Anomaly detection, with broad application in medical diagnosis, network intrusion detection, credit card fraud detection, sensor network fault detection and numerous other fields 
, has recently received significant attention among machine learning community. Considering the scarcity and diversity of anomalous data, anomaly detection is usually modeled as an unsupervised learning problem or one-class classification problem, i.e. the training dataset contains only “normal” data and the anomalous data is not available during training.
With the recent advances in deep neural networks, reconstruction-based methods[35, 1, 33] have shown great promise for anomaly detection. Autoencoder  is adopted by most reconstruction-based methods which assume that normal samples and anomalous samples could lead to significantly different embedding and thus the corresponding reconstruction errors can be leveraged to differentiate the two types of samples . However, this assumption may not always hold. Reconstruction-based methods could fail when the autoencoder captures mostly the shared patterns between the normal and anomalous data and “generalizes” so well [12, 42], which results in good reconstruction of both normal and anomalous data. To mitigate this drawback, Perera et al.  proposed to explicitly constrain the latent space to exclusively represent the given normal class but not the anomalous class, which enlarges the gap of reconstruction errors between normal and anomalous data, suggesting that it is important to control what information is embedded in the autoencoder.
As shown in Figure 1, we here propose a transform-based method for anomaly detection, which leverages transformation as a way to control information embedded in the autoencoder and further enlarges the gap of reconstruction errors between normal and anomalous data. Different from , which controls the embedding by constraining the latent space, we apply a selected set of transformations (e.g. graying and random rotation) to erase certain targeted information from images, and further leverage the reconstruction capability of the autoencoder to embed the targeted information during the restoration of the original images. The inverse-transform autoencoders (named ITAE thereafter) are trained with normal samples only and thus embed only key information for the normal class.
For the success of ITAE for anomaly detection, the transformations need to satisfy a vital criteria, i.e., the information erased by transformations must be able to distinguish between normal and anomalous data; otherwise, the indistinguishable erased information will lead to similar information embedding, resulting in similar restoration errors for both normal and anomalous data. Due to the anomaly detection is unsupervised in nature, it would be difficult to know which transformation would work ahead of time. We thus propose to simultaneously adopt a simple set of universal transformations based on human prior. For example, for the task of handwriting digit recognition, orientation is important for the normal class “6” based on human prior. Applying a random rotation transformation to the numbers can force the ITAE to embed orientation information during restoration. When an anomalous sample (such as the number “9”) is fed to the ITAE, it is likely to be restored to “6” as the model cannot differentiate whether it is a “9” or a rotated “6”, hence leading to large restoration errors.
To validate the effectiveness of ITAE, we conduct extensive experiments with several benchmarks and compare them with state-of-the-art methods. Our experimental results have shown that ITAE outperforms state-of-the-art methods in terms of model accuracy and model stability for different tasks. To further evaluate with more challenging tasks, we experiment with the large-scale dataset ImageNet  and show that ITAE improves the AUROC of the top-performing baseline by 10.1%. Experiments on a real-world anomaly detection dataset MVTec AD  and a most recent video anomaly detection benchmark dataset ShanghaiTech  show that our ITAE is more adaptable to complex real-world environments.
2 Related Works
For anomaly detection on images or videos, a large variety of methods have been developed in recent years [7, 24, 25, 8, 28, 19]. Popular methods focusing on anomaly detection in still images, which we study in this paper, can be concluded into three types: Statistics-based, reconstruction-based and classification-based approaches.
tended to depict the normal data with statistical approaches. Through training, a distribution function was forced to fit on the features extracted from the normal data to represent them in a shared latent space. During testing, samples mapped to different statistical representations are considered as anomalous.
Reconstruction-based approaches: Many works considered anomaly detection task through reconstruction [3, 36, 35, 42, 9]. Based on autoencoders, these methods compressed normal samples into a lower-dimensional representation space and then reconstructed higher-dimensional outputs. The normal and anomalous samples were distinguished through some reconstruction errors. Sabokrou et al.  and Akcay et al.  employed adversarial training to optimize the autoencoder and leveraged its discriminator to further enlarge the reconstruction error gap between normal and anomalous data. Furthermore, Akcay et al.  leveraged another encoder to embed the reconstruction results to the subspace where to calculate the reconstruction error. Gong et al.  augmented the autoencoder with a memory module and developed an improved autoencoder called memory-augmented autoencoder to strengthen reconstructed errors on anomalies. 
applied two adversarial discriminators and a classifier on a denoising autoencoder. By adding constraint and forcing each randomly drawn latent code to reconstruct examples like the normal data, it obtained high reconstruction errors for the anomalous data.
Classification-based approaches: Some approaches tackled the anomaly detection problem through classification. Hendrycks  observed that anomaly detection can still benefit from introducing extra data under a classification framework, even though the extra data is in limited quantities and weakly correlated to the normal data. Lee et al.  used Kullback-Leibler (KL) divergence to guide GAN to generate anomalous data more closed to the normal data, leading to a better training set for classification method. Golan et al.  applied dozens of image geometric transforms and created a self-labeled dataset for transformation classification. The model tends to capture the patterns of different transformations. Different from , the transformations utilized in our work are to erase information from input data.
3 Inverse-Transform AutoEncoder
3.1 Problem Statement
In this section, we first formulate the problem of anomaly detection on images. Let , , and denote the sets of entire dataset, normal dataset and anomalous dataset, respectively, where and . Given any image , where , and , and denote the dimensions of image channels, height and width, the goal is to build a model for discriminating whether or
. We formulate this process in a view of probability: (1) draw an image from the dataset; (2) test if this sample comes from the distribution of normal data or anomalous data, i.e. and . Thus, our model is to parameterize the posterior and test if the given sample belongs to the marginal distribution simultaneously.
3.2 Model Architectures
In this section, we present the Inverse-Transform AutoEncoder (ITAE) in detail. ITAE is based on an encoder-decoder framework to restore the samples from some information-erasing transformations. The restoration errors are used for anomaly detection.
To achieve effective anomaly detection, we design some efficient transformations based on human priors to erase the crucial and distinctive information of input samples. Commonly, the transformation satisfies two conditions:
The transformations need to erase information
; otherwise the transformations are linear invertible and ITAE degenerates to the linear reverse transformations. The transformation and ITAE are multiplied as an identity matrix, causing low restoration error for both normal and anomalous samples.
The transformations should erase more distinctive information, which is the key to differentiate normal and anomalous data; otherwise the model tends to restore more common information, leading to similar restoration errors for any data.
The proposed ITAE forces the autoencoder to embed the erased information in the model and restore the normal samples. Suppose we have a set of transformations , where denotes the th transformation and is the operation number. In the training phase, given sampled from and any transformation , let the transforms be . The proposed ITAE takes the transformed samples as the inputs, and attempts to inversely restore the original training samples . Mathematically, given , the restored sample be is formulated as
where and indicate encoder and decoder of ITAE. By minimizing the likelihood-based restoration loss, ITAE is forced to capture the specific pattern of inverse transformation to obtain the original normal training data. The corresponding structure is shown in Figure 2. Note that while ITAE is employed for the inverse-transform, it is different from existing autoencoder-based anomaly detection methods in that the inputs and outputs are asymmetrical, i.e. ITAE needs to restore information erased by the transformations.
As for testing phases, both normal and anomalous data are fed into the model. We design a metric based on the restoration error to distinguish whether one sample belongs to the normal set. We suppose that the restorations of normal samples show much smaller errors than the anomalous samples due to the specific inverse transforming scheme.
3.3 Training Loss and Error Measurement
To train our ITAE for effective anomaly detection, loss is utilized to measure the distances between the restored samples and targets since it is smoother and distributes more punishments on the dimensions with larger generation errors. Let the target image be , the training loss is formulated as
where denotes the norm. We use Monte Carlo to approximate the expectation operation by averaging the costs on samples and transformations in each mini-batch.
In the test phases, we calculate the restoration error of each input image for anomaly detection. We note that loss is more suitable to measure the distance between outputs and original images. For each in the transformation set , we first calculate the expectation based restoration error of training normal data; then we use this error to normalize the test restoration error corresponding to ; we finally calculate the expectation of normalized errors across different transformations. Let the test sample be , the restoration error is formulated as
where indicates the distribution of normal data, as well as being consistent with the distribution of training set; denotes the norm. We also approximate the expectation with averaging operation. A normal sample leads to a low restoration error; the higher obtained, the higher probability for the anomalous sample.
In this section, we conduct substantial experiments to validate our method. The ITAE is first evaluated on multiple commonly used benchmark datasets under unsupervised anomaly detection settings, and the large-scale dataset ImageNet 
, which is rarely looked into in previous anomaly detection studies. Next, we conduct experiments on real anomaly detection datasets to evaluate the performance in real-world environments. Then we present the respective effects of different designs (e.g. different types of image-level transformation and loss function design) through ablation study. Finally, the stability of our models is validated through monitoring performance fluctuation during the training process and comparing the final performance after convergence in multiple training attempts, all from random weights and with the same training configuration.
4.1 Experiments on Popular Benchmarks
4.1.1 Experimental Setups
Datasets. Our experiments involve five popular image datasets: MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100 and ImageNet. For all datasets, the training and test partitions remain as default. In addition, pixel values of all images are normalized to . We introduce these five datasets briefly as follows:
MNIST : consists of 70,000 handwritten grayscale digit images.
Fashion-MNIST : a relatively new dataset comprising grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category.
CIFAR-10 : consists of 60,000 RGB images of 10 classes, with 6,000 images for per class. There are 50,000 training images and 10,000 test images, divided in a uniform proportion across all classes.
CIFAR-100 : consists of 100 classes, each of which contains 600 RGB images. The 100 classes in the CIFAR-100 are grouped into 20 “superclasses” to make the experiment more concise and data volume of each selected “normal class” larger.
, a natural language processing method (see the appendix for more details). We note that few anomaly detection research has been conducted on ImageNet since its images have higher resolution and more complex background.
and add skip-connections between some layers in encoder and corresponding decoder layers to facilitate the backpropagation of the gradient in an attempt to improve the performance of image restoration. We use stochastic gradient descent (SGD)epochs, where means the number of transformations we used. The learning rate is initially set to 0.1, and is divided by 2 every epoch. In our experiments, we use a transformation which contains two tandem operations:
Graying: This operation averages each pixel value along the channel dimension of images.
Random rotation: This operation rotates anticlockwise by angle around the center of each image channel. The rotation angle is randomly selected from a set .
The graying operation erases channel information, and the random rotation operation erases objects’ orientation. Both of them meet the conditions we introduced in Section 3.2. For example, random rotation is not linear invertible (first condition) because of the randomness. Meanwhile, it removes the orientation of objects, which may be an important characteristic of an object (second condition).
Evaluation protocols. In our experiments, we quantify the model performance using the area under the Receiver Operating Characteristic (ROC) curve metric (AUROC). It is commonly adopted as performance measurement in anomaly detection tasks and eliminates the subjective decision of threshold value to divide the “normal” samples from the anomalous ones.
4.1.2 Comparison with State-of-the-art Methods
For a dataset with classes, we conduct a batch of experiments respectively with each of the classes set as the “normal” class once. We then evaluate performance on an independent test set, which contains samples from all classes, including normal and anomalous data. As all classes have equal volumes of samples in our selected datasets, the overall number proportion of normal and anomalous samples is simply .
In Table 1, we provide results on MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100 in detail. Some popular methods are involved in
comparison: VAE , DAGMM , DSEBM , AnoGAN , ADGAN , GANomaly , OCGAN , GeoTrans  and our baseline backbone AE. Results of VAE, AnoGAN and ADGAN are borrowed from . Results of DAGMM, DSEBM and GeoTrans are borrowed from . We use the officially released source code of GANomaly to fill the incomplete results reported in  with our experimental settings. For RGB datasets, such as CIFAR-10 and CIFAR-100, we use graying and random rotation operations tandemly, together with some standard data augmentations (flipping/mirroring/shifting), which is widely used in [14, 16]. For grayscale datasets, such as MNIST and Fashion-MNIST, we only use random rotation transformation, without any data augmentation.
On all involved datasets, experiment results present that the average AUROCs of ITAE outperform all other methods to different extents. For each individual image class, we also obtain competitive performances, showing effectiveness for anomaly detection. To further validate the effectiveness of our method, we conduct experiments on a subset of the ILSVRC 2012 classification dataset . Table 1 also shows the performance of GANomaly, GeoTrans, baseline AE and our method on ImageNet. As can be seen, our method significantly outperforms the other three methods. Our method maintains performance stability on more difficult datasets.
In addition, GeoTrans  takes up more GPU memory and computation time. For testing on CIFAR10 (totally 10,000 images), GeoTrans needs 285.45s (35fps, NVIDIA GTX 1080Ti, average on 10 runs) and it takes 1389MB of GPU memory. ITAE takes only 36.97s (270fps, same experimental environment) and 713MB of GPU memory ( faster than GeoTrans) thanks to its efficient pipeline and network structure. Figure 3 shows the comparison of frames per second (FPS), GPU memory usage and AUROCs of various anomaly detection methods tested on CIFAR-10. ITAE takes up a relatively small GPU memory, and its FPS is relatively higher.
4.2 Experiments on Real-world Anomaly Detection
Previous works [11, 9] experiment on multi-class classification datasets since there is a lack of comprehensive real-world datasets available for anomaly detection. By defining anomalous events as occurrences of different object classes and splitting the datasets based on unsupervised settings, the multi-class datasets can be used for anomaly detection experiments. However, the real anomalous data does not necessarily meet the above settings, e.g. damaged objects. In this section, we experiment on the most recent real-world anomaly detection benchmark dataset MVTec AD .
MVTec anomaly detection dataset. MVTec Anomaly Detection (MVTec AD) dataset  contains 5354 high-resolution color images of different object and texture categories. It contains normal images intended for training and images with anomalies intended for testing. The anomalies manifest themselves in the form of over 70 different types of defects such as scratches, dents, and various structural changes. In this paper, we conduct image-level anomaly detection tasks on MVTec AD dataset to classify normal and anomalous objects.
Comparison with state-of-the-art methods. Table 2 shows that our ITAE performs better than baseline AE, GANomaly and GeoTrans. The advantages of ITAE over GeoTrans are growing from ideal datasets to real-world datasets MVTec AD. We conclude that our ITAE is more adaptable to complex real-world environments.
4.3 Experiments on Video Anomaly Detection
Video anomaly detection, which is distinguished from image-level anomaly detection, requires detections of anomalous objects and strenuous motions in the video data. We here experiment on a most recent video anomaly detection benchmark dataset ShanghaiTech , comparing our methods with other state-of-the-arts.
ShanghaiTech. ShanghaiTech  has scenes with complex light conditions and camera angles. It contains anomalous events and over training frames. In the dataset, objects except for pedestrians (e.g. vehicles) and strenuous motion (e.g. fighting and chasing) are treated as anomalies.
Comparison with state-of-the-art methods. Since our ITAE is designed for image-level anomaly detection, different from some state-of-the-arts [23, 41, 12], we use single frames but not stacking neighbor frames as inputs. In order to apply the random rotation transformation, we resize all the images into . We here use ResNet34  as our encoder. Following [13, 23, 12], we obtain the normality score of the th frame by normalizing the errors to range :
where denotes the restoration error of the th frame in a video episode. The value of closer to indicates the frame is more likely an anomalous frame. Table 3 shows the AUROC values on ShanghaiTech dataset. Results show that our ITAE outperforms all the state-of-the-arts, including some temporal dependent methods [23, 41, 12].
4.4 Ablation Study and Discussion
In this part, we study the contribution of the proposed components of ITAE independently. Table 4 shows experimental results of ablation study on CIFAR-10. It shows that both graying and random rotation operations improve the performance significantly, especially the random rotation operation. Table 5 shows the ablation study about the selection of restoration loss. It proves that using loss as training loss and using
loss to calculate restoration error performs the best. Through the ablation study, we claim that the image transformation, network architecture and the loss function we used all have independent contributions to boost the model performance.
We use image scaling to study the degradation problem of the ITAE caused by ill-selected transformations. Downsampling of images can delete part of the image information. However, the second condition is not met since the deleted pixel-level information can be inferred from neighboring pixels and this rule is the same between normal and anomalous data. We test on CIFAR10 with a 0.5x scaling and obtain 58.8% AUROC for ITAE, while that of AE is 59.3%, showing that the ITAE degenerates into a vanilla AE with ill-selected transformations.
4.5 Model Stability
Anomaly detection puts higher concerns on the stability of model performance than traditional classification tasks. It is because of the lack of anomalous data makes it impossible to do validation during training. Model stability tends to be more important since we have no confidence to obtain a best checkpoint for anomaly detection model in a training period without validation.
The stability of model performance is mainly reflected in three aspects: 1) whether the model can stably reach convergence after acceptable training epochs; 2) whether the model can reach stable performance level in multiple independent training attempts under the same training configuration; 3) whether the model can stably achieve good performance in various datasets and training configurations. Figure 4 shows AUC-changing during one run to reveal that our model performs more stably in the late training phase, instead of fluctuating.
Thus, we have the confidence to obtain a robust model for this practically validation-unavailable task after a training period. In order to test the stability of multiple training performances, we rerun GeoTrans  and our method for 10 times on MNIST. Table 6 shows that GeoTrans suffers a larger performance fluctuation compared with our method. For the last one, the standard deviation (SD) among classes has a good measure. SD in Table 1 prove that our method has the strongest stability of this type.
4.6 Visualization Analysis
Anomaly detection on images. In order to demonstrate the effectiveness of transformations for anomaly detection in a simple and straightforward way, we visualize some restoration outputs from ITAE, comparing with GANomaly in Figure 5. All visualization results are based on the number “6” as normal samples.
The first column “Ori” represents original images. “I” means images after transformation. Note that outputs should always be compared with original images but not inputs. Cases with outputs similar to “Ori” are considered normal, otherwise anomalous. The bottom line in Figure 5 shows the testing of number 9. Four outputs are far different from “Ori” and thus recognized as anomalous. Except for the number “6”, the other numbers get the wrong direction or ambiguous restoration outputs from our ITAE. It enlarges the gap of restoration error between normal and anomalous data. However, all the outputs from GANomaly are similar to the ground truth, meaning that it is less capable to distinguish between normal and anomalous data. We conclude that our guide for information embedding by using transformations is successful since the outputs show that models attempt to restore all the images using the orientation distribution of the numbers learning from the training set.
Anomaly detection on videos. Figure 6 shows restoration error maps of AE and ITAE on an anomalous frame of ShanghaiTech. In this frame, chasing is the anomalous event (red bounding box in Figure 6 (a)). AE generalize so “well” that it can reconstruct this frame including the chasing humans. Thus, AE cannot correctly detect this anomalous event. ITAE significantly highlights the anomalous parts in the scene and it is the reason why ITAE outperforms state-of-the-arts on video anomaly detection.
5 Conclusion and Future Work
In this paper, we propose a novel technique named Inverse-Transform AutoEncoder (ITAE) for anomaly detection. Simple transformations are employed to erase certain information. The ITAE learns the inverse transform to restore the original data. The restoration error is expected to be a good indicator of anomalous data. We experiment with two simple transformations: graying and random rotation, and show that our method not only outperforms state-of-the-art methods but also achieves high stability. Notably, there are still more transformations to explore. These transformations, when added to the ITAE, are likely to further improve the performance for anomaly detection. We look forward to the addition of more transformations and the exploration of a more intelligent transformation selection strategy. In addition, this way of feature embedding can also be applied to more fields, opening avenues for future research.
-  (2018) GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training. ACCV. Cited by: Appendix D, §1, §4.1.2, Table 1, Table 1, Table 2.
-  (2019) Skip-ganomaly: skip connected and adversarially trained encoder-decoder anomaly detection. arXiv preprint arXiv:1901.08954. Cited by: §2, §4.1.1.
-  (2015) Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE. Cited by: §2.
-  (2019) MVTec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In CVPR, Cited by: §A.2, §1, §4.2, §4.2, Table 2.
-  (2003) Latent dirichlet allocation. Journal of machine Learning research. Cited by: §A.1, 5th item.
-  (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, Cited by: §4.1.1.
-  (2019) Deep learning for anomaly detection: a survey. Cited by: §2.
-  (2009) Anomaly detection: a survey. ACM computing surveys (CSUR). Cited by: §1, §2.
-  (2018) Anomaly detection with generative adversarial networks. Cited by: §2, §4.1.2, §4.2, Table 1, Table 1.
Anomaly detection over noisy data using learned probability distributions. In ICML, Cited by: §2.
-  (2018) Deep anomaly detection using geometric transformations. In NeurIPS, Cited by: §2, §4.1.2, §4.1.2, §4.2, §4.5, Table 1, Table 1, Table 2, Table 6.
-  (2019) Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. Cited by: §1, §2, §4.3, §4.3, Table 3.
-  (2016) Learning temporal regularity in video sequences. In CVPR, Cited by: §4.3, Table 3.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.2, §4.3.
Deep anomaly detection with outlier exposure. In ICLR, Cited by: §2.
-  (2017) Densely connected convolutional networks.. In CVPR, Cited by: §4.1.2.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §4.1.1.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.1.2, Table 1.
-  (2018) An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. Journal of Imaging. Cited by: §2.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §A.3, Table 9, Figure 7, Appendix C, 3rd item, 4th item.
The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: Figure 8, Appendix D, 1st item.
-  (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. In ICLR, Cited by: §2.
-  (2017) A revisit of sparse coding based anomaly detection in stacked rnn framework. In ICCV, Cited by: §1, §4.3, §4.3, §4.3, §4.3, Table 3.
-  (2003) Novelty detection: a review—part 1: statistical approaches. Signal Processing. Cited by: §2.
-  (2003) Novelty detection: a review—part 2: : neural network based approaches. Signal Processing. Cited by: §2.
-  (2011) Stacked convolutional auto-encoders for hierarchical feature extraction. In International Conference on Artificial Neural Networks, Cited by: Appendix D, §1.
-  (2019) OCGAN: one-class novelty detection using gans with constrained latent representations. Cited by: §1, §1, §2, §4.1.2, Table 1.
-  (2014) A review of novelty detection. Signal Processing. Cited by: §2.
Coherence pursuit: fast, simple, and robust principal component analysis. IEEE Transactions on Signal Processing. Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, Cited by: §4.1.1.
-  (2018) Deep one-class classification. In ICML, Cited by: §1.
-  (2015) Imagenet large scale visual recognition challenge. IJCV. Cited by: §A.1, §1, 5th item, §4.1.2, §4.
-  (2018) Adversarially learned one-class classifier for novelty detection. In CVPR, Cited by: §1, §2.
-  (2014) Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Mlsda Workshop on Machine Learning for Sensory Data Analysis, Cited by: §1.
-  (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, Cited by: §1, §2, §4.1.2, Table 1.
-  (2015) Learning discriminative reconstructions for unsupervised outlier removal. In ICCV, Cited by: §2.
-  (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §A.3, Table 9, Figure 9, Appendix D, 2nd item.
-  (2012) Robust pca via outlier pursuit. IEEE Transactions on Information Theory. Cited by: §2.
On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining & Knowledge Discovery. Cited by: §2.
Deep structured energy based models for anomaly detection. Cited by: §4.1.2, Table 1, Table 1.
-  (2017) Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia, Cited by: §4.3, §4.3, Table 3.
Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In ICLR, Cited by: §1, §2, §4.1.2, Table 1, Table 1.
Appendix A Class Names and Index for Datasets
a.1 ImageNet 
Even though there is actually a class-tree on the ImageNet website, to subjectively pick up classes from it is still not convincing to some extent and the fact that not all image labels are on the same tree-level brings additional troubles. We tried to be objective to use LDA , a popular language tool, during clustering instead of subjectively cherry-picking. During dividing samples into super-clusters, we tried to make the process as objectively as possible. Thus, we used LDA , a popular language processing tool, to do the categories clustering automatically. Since the limitation of computing resources we randomly select 10 categories in our experiment. Table 7 shows the specific category index.
|0||Snake||n01728920, n01728572, n01729322,|
|n01734418, n01737021, n01740131,|
|1||Finch||n01530575, n01531178, n01532829, n01534433, n01795545, n01796340|
|2||Spider||n01773157, n01773549, n01774384, n01775062, n01773797, n01774750|
|3||Big cat||n02128385, n02128925, n02129604, n02130308, n02128757, n02129165|
|4||Beetle||n02165105, n02165456, n02169497, n02177972, n02167151|
|5||Wading bird||n02007558, n02012849, n02013706, n02018795, n02006656|
|6||Monkey||n02486261, n02486410, n02488291, n02489166|
|7||Fungus||n12985857, n13037406, n13054560, n13040303|
|8||Cat||n02123045, n02123394, n02124075, n02123159|
|9||Dog||n02088364, n02105412, n02106030, n02106166, n02106662, n02106550, n02088466, n02093754, n02091635|
a.2 MVTec AD 
MVTec AD dataset contains 5354 high-resolution color images of different object and texture categories. It contains normal images intended for training and images with anomalies intended for testing. The anomalies manifest themselves in the form of over 70 different types of defects such as scratches, dents, contaminations, and various structural changes. Table 8 shows class names and anomalous types for each categories.
|Class Name||Anomalous Types|
|0||Bottle||broken large, broken small, contamination|
|1||Capsule||crack, faulty imprint, poke, scratch, squeeze|
|2||Grid||bent, broken, glue, metal contamination, thread|
|3||Leather||color, cut, fold, glue, poke|
|4||Pill||color, combined, contamination, crack, faulty imprint, pill type, scratch|
|5||Tile||crack, glue strip, gray stroke, oil, rough|
|6||Transistor||bent, cut, damaged, misplaced|
|7||Zipper||broken teeth, combined, fabric border, fabric interior, rough, split teeth, squeezed teeth|
|8||Cable||bent wire, cable swap, combined, cut inner insulation, cut outer insulation, missing cable, missing wire, poke insulation|
|9||Carpet||color, cut, hole, metal contamination, thread|
|10||Hazelnut||crack, cut, hole, print|
|11||Metal nut||bent, color, flip, scratch|
|12||Screw||manipulated front, scratch head, scratch neck, thread side, thread top|
|14||Wood||color, combined, hole, liquid, scratch|
a.3 Other Datasets
|4||Fruit and vegetables|
|5||Household electrical devices|
|CIFAR-||9||Large man-made outdoor things|
|100||10||Large natural outdoor scenes|
|11||Large omnivores and herbivores|
Appendix B Model Structure of ITAE
Table 10 shows the model structure of ITAE. It bases on an encoder-decoder framework. It totally has 4 blocks for the encoder and 4 blocks for the decoder. Each block has a maxpooling or an upsampling operation, following two convolutional layers. Skip-connection operations are added to facilitate the backpropagation of the gradient and improve the performance of image reconstruction.
Appendix C Visualization Analysis on CIFAR-10
In order to further demonstrate the effectiveness of transformations for anomaly detection, we visualize some restoration outputs of CIFAR-10  from ITAE in Figure 7. All visualization results are based on the class “horse” as the normal class. The first column “Ori” represents original images. “I” means images after transformation. Note that outputs should always be compared with original images but not inputs. Cases with outputs similar to “Ori”, i.e. lower loss, are considered normal, otherwise anomalous. ITAE enlarges the gap of restoration error between normal and anomalous data.
Appendix D Model Stability
We argue that our proposed method achieves more robust performance. The main challenge in the task of anomaly detection is the lack of negative samples. Without validation, model stability tends to be more important than traditional data classification tasks. We train three models, including ITAE, traditional autoencoder  and GANomaly , respectively on each category of MNIST  and Fashion-MNIST  datasets and test models every 5 epochs along with training. The traditional autoencoder  and GANomaly  are set as our baseline model. The model performance of validation during the training process is shown in Figure 8 and Figure 9, from which we can see the performance of our ITAE method always converge in a high position; moreover, ITAE shows the highest performance stability at the end of the training process.