1. Introduction
Anomaly detection, a.k.a. outlier detection, is referred to as the process of detecting data instances that significantly deviate from the majority of data instances. Anomaly detection has been an active research area for several decades, with early exploration dating back as far as to 1960s (Grubbs, 1969)
. Due to the increasing demand and broader applications in domains such as risk management, compliance, security, financial surveillance, health and medical risk, and AI safety, anomaly detection plays increasingly important roles, highlighted in various communities including data mining, machine learning, computer vision and statistics. In recent years, deep learning has shown tremendous capabilities in learning expressive representations of complex data such as highdimensional data, temporal data, spatial data and graph data, pushing the boundaries of different learning tasks. Deep learning for anomaly detection,
deep anomaly detection for short, aim at learning feature representations or anomaly scores via neural networks for the sake of anomaly detection. In recent years, a large number of deep anomaly detection methods have been introduced, demonstrating significantly better performance than conventional anomaly detection on addressing challenging detection problems in a variety of realworld applications. This work aims to provide a comprehensive review of this area. We first discuss the problem nature and major challenges of anomaly detection, then systematically review the current deep anomaly detection methods and their capabilities in addressing these challenges, and finally presents a number of future opportunities in this area.As a popular area, a number of studies (Hodge and Austin, 2004; Chandola et al., 2009; Aggarwal, 2017; Zimek et al., 2012; Akoglu et al., 2015; Gupta et al., 2013; Boukerche et al., 2020) have been dedicated to the categorization and review of anomaly detection techniques. However, they all focus on conventional anomaly detection methods only. One work closely related to ours is (Chalapathy and Chawla, 2019). It presents a good summary of a number of realworld applications of deep anomaly detection, but only provides some very highlevel outlines of selective categories of the techniques, from which it is highly difficult, if not impossible, to gain the sense of the approaches taken by the current methods and the intuition behind the methods. By contrast, to answer why we need deep anomaly detection, this review delineates the formulation of current deep detection methods to gain key insights about their underlying intuitions, inherent capabilities and weakness on addressing some largely unsolved challenges in anomaly detection. This forms a deep understanding of the problem nature and the stateoftheart, and brings about genuine open opportunities.
In summary, this work makes the following five major contributions:

Problem nature and challenges. We discuss some unique problem complexities underlying anomaly detection and the resulting largely unsolved challenges.

Categorization and formulation
. We formulate the current deep anomaly detection methods into three principled frameworks: deep learning for generic feature extraction, learning representations of normality, and endtoend anomaly score learning. A hierarchical taxonomy is presented to categorize the methods based on 11 different modeling perspectives.

Comprehensive literature review
. We review a large number of relevant studies in leading conferences and journals of several relevant communities, including machine learning, data mining, computer vision and artificial intelligence, to present a comprehensive literature review of the research progress. To provide an indepth introduction, we delineate the basic assumptions, objective functions, key intuitions and their capabilities in addressing some of the aforementioned challenges by all categories of the methods.

Future opportunities. We further discuss a set of possible future opportunities and their implication to addressing relevant challenges.

Source codes and datasets. We solicit a collection of publicly accessible source codes of nearly all categories of methods and a large number of realworld datasets with real anomalies to offer some empirical comparison benchmarks.
2. Anomaly Detection: Problem Complexities and Challenges
Owing to the unique nature, anomaly detection presents distinct problem complexities from the majority of analytical and learning problems and tasks. This section summarizes such intrinsic complexities and unsolved detection challenges in complex anomaly data.
2.1. Major Problem Complexities
Unlike those problems and tasks on majority, regular or evident patterns, anomaly detection addresses minority, unpredictable/uncertain and rare events, leading to some unique complexities below that render general deep learning techniques ineffective.

Unknownness. Anomalies are associated with many unknowns, e.g., instances with unknown abrupt behaviors, data structures, and distributions. They remain unknown until actually occur, such as novel terrorist attacks, frauds and network intrusions.

Heterogeneous anomaly classes. Anomalies are irregular, and thus, one class of anomalies may demonstrate completely different abnormal characteristics from another class of anomalies. For example, in video surveillance, the abnormal events robbery, traffic accidents and burglary are visually highly different.

Rarity and class imbalance. Anomalies are typically rare data instances, contrasting to normal instances that often account for an overwhelming proportion of the data. Therefore, it is difficult, if not impossible, to collect a large amount of labeled abnormal instances. This results in the unavailability of largescale labeled data in most applications. The class imbalance is also due to the fact that misclassification of anomalies is normally much more costly than that of normal instances.

Diverse types of anomaly. Three completely different types of anomaly have been explored (Chandola et al., 2009). Point anomalies are individual instances that are anomalous w.r.t. the majority of other individual instances, e.g., the abnormal health indicators of a patient. Conditional anomalies, a.k.a. contextual anomalies, also refer to individual anomalous instances but in a specific context, i.e., data instances are anomalous in the specific context, otherwise normal. The contexts can be highly different in realworld applications, e.g., sudden temperature drop/increase in a particular temporal context, or rapid credit card transactions in unusual spatial contexts. Group anomalies, a.k.a. collective anomalies, are a subset of data instances anomalous as a whole w.r.t. the other data instances; the individual members of the collective anomaly may not be anomalies, e.g., exceptionally dense subgraphs formed by fake accounts in social network are anomalies as a collection, but the individual nodes in those subgraphs can be as normal as real accounts.
2.2. Main Detection Challenges
The above complex problem nature leads to a number of detection challenges to traditional anomaly detection methods and widelyused general deep learning methods. Some challenges, such as scalability w.r.t. data size, have been well addressed in recent years, while the following are largely unsolved, to which deep anomaly detection can play some essential roles.

CH1: Low anomaly detection recall rate. Since anomalies are highly rare and heterogeneous, it is difficult to identify all of the anomalies. Many normal instances are wrongly reported as anomalies while true yet sophisticated anomalies are missed. Although a plethora of anomaly detection methods have been introduced over the years, the current stateoftheart methods, especially unsupervised methods (e.g., (Breunig et al., 2000; Liu et al., 2012a)), still often incur high false positives on realworld datasets (Campos et al., 2016; Pang et al., 2019a). How to reduce false positives and enhance detection recall rates is one of the most important and yet difficult challenges, particularly for the significant expense of failing to spotting anomalies.

CH2: Anomaly detection in highdimensional and/or notindependent data. Anomalies often exhibit evident abnormal characteristics in a lowdimensional space yet become hidden and unnoticeable in a highdimensional space. Highdimensional anomaly detection has been a longstanding problem (Zimek et al., 2012). Performing anomaly detection in a reduced lowerdimensional space spanned by a small subset of original features or newly constructed features is a straightforward solution, e.g., in subspacebased (Lazarevic and Kumar, 2005; Liu et al., 2012b; Keller et al., 2012; Pevnỳ, 2016)
and feature selectionbased methods
(Pang et al., 2017; Azmandian et al., 2012; Pang et al., 2017, 2018b). However, identifying intricate (e.g., highorder, nonlinear and heterogeneous) feature interactions and couplings (Cao, 2015) may be essential in highdimensional data yet remains a major challenge for anomaly detection. Further, how to guarantee the new feature space preserved proper information for specific detection methods is critical to downstream accurate anomaly detection, but it is challenging due to the aforementioned unknowns and heterogeneities of anomalies. Also, it is challenging to detect anomalies from instances that may be dependent on each other such as by temporal, spatial, graphbased and other interdependency relationships (Cao, 2015; Aggarwal, 2017; Akoglu et al., 2015; Gupta et al., 2013). 
CH3: Dataefficient learning of normality/abnormality. Due to the difficulty and cost of collecting largescale labeled anomaly data, fully supervised anomaly detection is often impractical as it assumes the availability of labeled training data with both normal and anomaly classes. In the last decade, major research efforts have been focused on unsupervised anomaly detection that does not require any labeled training data. However, unsupervised methods do not have any prior knowledge of true anomalies. They rely heavily on their assumption on the distribution of anomalies but fail to work in datasets where their assumption is violated. On the other hand, it is often not difficult to collect labeled normal data and some labeled anomaly data. In practice, it is often suggested to leverage such readily accessible labeled data as much as possible (Aggarwal, 2017). Thus, utilizing those labeled data to learn expressive representations of normality/abnormality is crucial for accurate anomaly detection. Semisupervised anomaly detection, which assumes that there exists a set of labeled training data^{1}^{1}1There have been some studies that refer the methods trained with purely normal training data to be unsupervised approach. However, this setting is different from the general sense of an unsupervised setting. To avoid unnecessary confusion, following (Chandola et al., 2009; Aggarwal, 2017), these methods are referred to as semisupervised methods hereafter., is a research direction dedicated to this problem. Another research line is weaklysupervised anomaly detection that assumes we have some labels for anomaly classes yet the class labels are partial/incomplete (i.e., they do not span the entire set of anomaly class), inexact (i.e., coarsegrained labels), or inaccurate (i.e., some given labels can be incorrect). Two major challenges are how to learn expressive normality/abnormality representations with a small amount of labeled anomaly data and how to learn detection models that are generalized to novel anomalies uncovered by the given labeled anomaly data.

CH4: Noiseresilient anomaly detection. Many weakly/semisupervised anomaly detection methods assume the given labeled training data is clean, which can be highly vulnerable to noisy instances that are mistakenly labeled as an opposite class label. In such cases, we may use unsupervised methods instead, but this fails to utilize the genuine labeled data. Additionally, there often exists largescale anomalycontaminated unlabeled data. Noiseresilient models can further leverage those unlabeled data for more accurate detection. The main challenge is that the amount of noises can differ significantly from datasets and noisy instances may be irregularly distributed in the data space.

CH5: Detection of complex anomalies. Most of existing methods are for point anomalies, which cannot be used for conditional anomaly and group anomaly since they exhibit completely different behaviors from point anomalies. One main challenge here is to incorporate the concept of conditional/group anomalies into anomaly measures/models. Also, current methods mainly focus on detect anomalies from single data sources, while many applications require the detection of anomalies with multiple heterogeneous data sources, e.g., multidimensional data, graph, image, text and audio data. One main challenge is that some anomalies can be detected only when considering two or more data sources.

CH6: Anomaly explanation. In many critical domains there may be some major risks if anomaly detection models are directly used as blackbox models. For example, the rare data instances reported as anomalies may lead to possible algorithmic bias against the minority groups presented in the data, such as underrepresented groups in fraud detection and crime detection systems. An effective approach to mitigate this type of risk is to have anomaly explanation algorithms that provide straightforward clues about why a specific data instance is identified as anomaly. Providing such explanation can be as important as detection accuracy in some applications. However, most existing anomaly detection studies focus on devising accurate detection models only, ignoring the capability of providing explanation of the identified anomalies. To derive anomaly explanation from specific detection methods is still a largely unsolved problem, especially for complex models. Developing inherently interpretable anomaly detection models is also crucial, but it remains a main challenge to well balance the model’s interpretability and effectiveness.
3. Addressing the Challenges with Deep Anomaly Detection
3.1. Preliminaries
Deep neural networks leverage complex compositions of linear/nonlinear functions that can be represented by a computational graph to learn expressive representations (Goodfellow et al., 2016)
. Two basic building blocks of deep learning are activation functions and layers.
Activation functions determine the output of computational graph nodes (i.e., neurons in neural networks) given some inputs. They can be linear or nonlinear functions. Some popular activation functions include linear, sigmoid, tanh, ReLU (Rectified Linear Unit) and its variants. A
layerin neural networks refers to a set of neurons stacked in some forms. Commonlyused layers include fully connected, convolutional & pooling, and recurrent layers. These layers can be leveraged to build different popular neural networks. For example, multilayer perceptron (MLP) networks are composed by fully connected layers, convolutional neural networks (CNN) are featured by varying groups of convolutional & pooling layers, and recurrent neural networks (RNN),
e.g., vanilla RNN, gated recurrent units (GRU) and long short term memory (LSTM), are built upon recurrent layers. See
(Goodfellow et al., 2016) for detailed introduction of these neural networks.Given a dataset with , let () be a representation space, then deep anomaly detection aims at learning a feature representation mapping function or an anomaly score learning function in a way that anomalies can be easily differentiated from the normal data instances in the or space, where both and are a neural networkenabled mapping function with hidden layers and their weight matrices . In the case of learning the feature mapping , an additional step is required to calculate the anomaly score of each data instance in the new representation space, while can directly infer the anomaly scores with raw data inputs.
3.2. Categorization of Deep Anomaly Detection
To have a thorough understanding of the area, we introduce a hierarchical taxonomy to classify existing deep anomaly detection methods into three main categories and 11 finegrained categories from the modeling perspective. An overview of the taxonomy of the methods, together with the challenges they address, is shown in Fig.
1. Specifically, deep anomaly detection consists of three conceptual paradigms  Deep Learning for Feature Extraction, Learning Feature Representations of Normality, and Endtoend Anomaly Score Learning.The procedure of each of these three frameworks is presented in Fig. 2. As shown in Fig. 2(a), deep learning and anomaly detection are fully separated in the first main category (Section 4), so deep learning techniques are used as some independent feature extractors only. The two modules are dependent on each other in some form in the second main category (Section 5) presented in Fig. 2(b), with an objective of learning expressive representations of normality. This category of methods can be further divided into two subcategories based on whether traditional anomaly measures are incorporated into their objective functions. These two subcategories encompass seven finegrained categories of methods, with each category taking a different approach to formulate its objective function. The two modules are fully unified in the third main category (Section 6) presented in Fig. 2(c), in which the methods are dedicated to learning anomaly scores via neural networks in an endtoend fashion. These methods are further grouped into four categories based on the formulation of neural networkenabled anomaly scoring. In the following three sections we review the methods in each of these three categories in detail and discuss how they address some of the aforementioned challenges.
4. Deep Learning for Feature Extraction
This category of studies represents the most basic application of deep learning techniques to anomaly detection. It aims at leveraging deep learning to extract lowdimensional feature representations from highdimensional and/or nonlinearly separable data for downstream anomaly detection. The feature extraction and the anomaly scoring are fully disjointed and independent from each other. Thus, the deep learning components work purely as dimensionality reduction only. Formally, the approach can be represented as
(1) 
where is a deep neural networkbased feature mapping function, with , and normally . An anomaly scoring method that has no connection to the feature mapping is then applied onto the new space to calculate anomaly scores.
Compared to the dimension reduction methods that are popular in anomaly detection, such as principal component analysis (PCA)
(Schölkopf et al., 1997; Zou et al., 2006; Candès et al., 2011) and random projection (Li et al., 2006; Pevnỳ, 2016; Pang et al., 2018a), deep learning techniques have been demonstrating substantially better capability in extracting semanticrich features and nonlinear feature relations (Bengio et al., 2013; Goodfellow et al., 2016).Assumptions. The feature representations extracted by deep learning models preserve the discriminative information that helps separate anomalies from normal instances.
One research line is to directly uses popular effective pretrained deep learning models, such as AlexNet (Krizhevsky et al., 2012), VGG (Simonyan and Zisserman, 2015) and ResNet (He et al., 2016), to extract lowdimensional features. This line is explored in anomaly detection in complex highdimensional data such as image data and video data. One interesting work of this line is the unmasking framework for online anomaly detection (Tudor Ionescu et al., 2017). The key idea of the framework is to iteratively train a binary classifier to separate one set of video frames from its subsequent video frames in a sliding window, with the most discriminant features removed in each iteration step. This is analogous to an unmasking process. The framework assumes the first set of video frames as normal and evaluates its separability from its subsequent video frames. Thus, the training classification accuracy is expected to be high if the subsequent video frames are abnormal, and low otherwise. The unmasking is an anomaly scoring process, with the change of the training accuracy used to define the anomaly scores. Clearly the power of the unmasking framework relies heavily on the quality of the features, so it is essential to have quality features to represent the video frames. The VGG model pretrained on the ILSVRC benchmark (Russakovsky et al., 2015) is shown to be effective to extract expressive appearance features for this purpose (Tudor Ionescu et al., 2017). In (Liu et al., 2018a), the masking framework is formulated as a twosample test task to understand its theoretical foundation. They also show that using features extracted from a dynamically updated sampling pool of video frames is found to improve the performance of the framework. Additionally, similar to other tasks like classification, the feature representations extracted from the deep models pretrained on a source dataset can be transferred to finetune a anomaly detector on a target dataset. As shown in (Andrews et al., 2016)
, oneclass support vector machines (SVM) can be first initialized with the features extracted from the VGG models pretrained on the ILSVRC benchmark and then finetuned to improve anomaly classification on the MNIST data
(LeCun et al., 1998).Another research line in this category is to explicitly train a deep feature extraction model rather than a pretrained model for the downstream anomaly scoring
(Xu et al., 2015; Erfani et al., 2016; Ionescu et al., 2019; Yu et al., 2018). Particularly, in (Xu et al., 2015), three separate autoencoder networks are trained to learn lowdimensional features for respective appearance, motion, and appearancemotion joint representations for video anomaly detection. An ensemble of three oneclass SVMs is independently trained on each of these learned feature representations to perform anomaly scoring. Similar to
(Xu et al., 2015), a linear oneclass SVM is used to enable anomaly detection on lowdimensional representations of highdimensional tabular data yielded by deep belief networks (DBNs)
(Erfani et al., 2016). Instead of oneclass SVM, unsupervised classification approaches are used in (Ionescu et al., 2019) to enable anomaly scoring in the projected space. Specially, they first cluster the lowdimensional features of video frames yielded by convolutional autoencoders and then treat the cluster labels as pseudo class labels and perform onevstherest classification to calculate the anomaly scores of frames. Similar approaches can also be found in graph anomaly detection (Yu et al., 2018), in which unsupervised clusteringbased anomaly measures are used in the latent representation space to calculate the abnormality of graph vertices or edges. To learn expressive representations of graph vertices, the vertex representations are optimized by minimizing autoencoderbased reconstruction loss and pairwise distances of neighbored graph vertices in the representation space, with onehot encoding of graph vertices as inputs.
Advantages. The advantages of this group of methods are as follows. (i) A large number of stateoftheart (pretrained) deep models and offtheshelf anomaly detection methods are readily available. (ii) Deep feature extraction offers more powerful dimensionality reduction than popular linear methods. (iii) It is easy to implement such methods given the public availability of the deep models and detection methods.
Disadvantages. Their disadvantages are as follows. (i) The fully disjointed feature extraction and anomaly scoring often lead to suboptimal anomaly scores. (ii) Pretrained deep models are typically limited to specific types of data.
Challenges Targeted. This category of methods projects highdimensional/nonindependent data onto substantially lowerdimensional space, enabling existing anomaly detection methods to work on simpler data space. The lowerdimensional space often helps reveal hidden anomalies and reduces false positives (CH2). However, it should be noted that these methods may not preserve sufficient information in the projected space specifically for anomaly detection as the data projection is fully decoupled with anomaly detection. In addition, this approach allows us to leverage multiple types of features and learn semanticrich detection models (e.g., various predefined image/video features in (Xu et al., 2015; Tudor Ionescu et al., 2017; Ionescu et al., 2019)), which also helps reduce false positives (CH1).
5. Learning Feature Representations of Normality
This section reviews the models from the perspective of normality learning. The deep anomaly detection methods in this category couple feature learning with anomaly scoring in some ways, which are different from the methods in the last section that fully decouple these two modules. These methods generally fall into two groups: generic feature learning and anomaly measuredependent feature learning. Below we discuss these two types of approaches in detail.
5.1. Generic Normality Feature Learning
This category of methods learns the representations of data instances by optimizing a generic feature learning objective function that is not primarily designed for anomaly detection, but the learned representations can still empower the anomaly detection since they are forced to capture some key underlying data regularities. Formally, this framework can be represented as
(2)  
(3) 
where maps the original data onto the representation space , parameterized by is a surrogate learning task that operates on the space and is dedicated to enforcing the learning of underlying data regularities,
is a loss function relative to the underlying modeling approach, and
is an anomaly scoring function that utilizes these two functions with the trained parameters and to calculate the anomaly score .This approach include methods that are driven by several perspectives, including data reconstruction, generative modeling, predictability modeling and selfsupervised classification.
5.1.1. Autoencoders
This type of approach aims to learn some lowdimensional feature representation space on which the given data instances can be well reconstructed. This is a widelyused technique for data compression or dimension reduction (Hinton and Salakhutdinov, 2006; Jiang et al., 2014; Theis et al., 2017)
. The heuristic for using this technique in anomaly detection is that the learned feature representations are enforced to learn important regularities of the data to minimize reconstruction errors; anomalies are difficult to be reconstructed from the resulting representations and thus have large reconstruction errors.
Assumptions. Normal data instances can be better restructured from compressed feature space than anomalies.
Autoencoder (AE) networks are the commonlyused techniques in this category. An AE is composed of an encoding network and an decoding network. The encoder maps the original data onto lowdimensional feature space, while the decoder attempts to recover the data from the projected lowdimensional space. The parameters of these two networks are learned with a reconstruction loss function. A bottleneck network architecture is often used to obtain lowdimensional representations than the original data, which forces the model to retain the information that is important in reconstructing the data instances. To minimize the overall reconstruction error, the retained information is required to be as much relevant as possible to the dominant instances, e.g., the normal instances. As a result, the data instances such as anomalies which deviate from the majority of the data are poorly reconstructed. The data reconstruction error therefore well fits the anomaly score. The basic formulation of this approach is given as follows.
(4)  
(5)  
(6)  
(7) 
where is the encoding network with the parameters and is the decoding network with the parameters . The encoder and the decoder can share the same weight parameters to reduce parameters and regularize the learning. is a reconstruction errorbased anomaly score of .
Several types of regularized autoencoders have been introduced to learn richer and more expressive feature representations (Makhzani and Frey, 2014; Vincent et al., 2010; Rifai et al., 2011; Doersch, 2016). Particularly, sparse AE is trained in a way that encourages sparsity in the activation units of the hidden layer, e.g., by keeping the top most active units (Makhzani and Frey, 2014). Denoising AE (Vincent et al., 2010) aims at learning representations that are robust to small variations by learning to reconstruct data from some predefined corrupted data instances rather than original data. Contractive AE (Rifai et al., 2011) takes a step further to learn feature representations that are robust to small variations of the instances around their neighbors, which is achieved by adding a penalty term based on the Frobenius norm of the Jacobian matrix of the encoder’s activations. Variational AE (Doersch, 2016) instead introduces regularization into the representation space by encoding data instances using a prior distribution over the latent space, preventing overfitting and ensuring some good properties of the learned space for enabling generation of meaningful data instances.
AEs are easytoimplement and have straightforward intuition in detecting anomalies. As a result, they have been widely explored in the literature. Replicator neural network (Hawkins et al., 2002) is the first piece of work in exploring the idea of data reconstruction to detect anomalies, with experiments focused on static multidimensional/tabular data. The Replicator network is built upon a feedforward multilayer perceptron with three hidden layers. It uses parameterized
hyperbolic tangent activation functions to obtain different activation levels for different input values, which helps discretize the intermediate representations into some predefined bins. As a result, the hidden layers naturally cluster the data instances into a number of groups, enabling the detection of clustered anomalies. After this work there have been a number of studies dedicated to further enhance the performance of AEs in anomaly detection. For instance, RandNet
(Chen et al., 2017) further enhances the basic AEs by learning an ensemble of AEs. In RandNet, a set of independent AEs are trained, with each AE having some randomly selected constant dropout connections. An adaptive sampling strategy is used by exponentially increasing the sample size of the minibatches. RandNet is focused on tabular data. The idea of autoencoder ensembles is extended to time series data in (Kieu et al., 2019). Motivated by robust principal component analysis (RPCA), RDA (Zhou and Paffenroth, 2017) attempts to improve the robustness of AEs by iteratively decomposing the original data into two subsets, normal instance set and anomaly set. This is achieved by adding a sparsity penalty or grouped penalty into its RPCAalike objective function to regularize the coefficients of the anomaly set.AEs are also widely leveraged to detect anomalies in data other than tabular data, such as sequence data (Lu et al., 2017), graph data (Ding et al., 2019) and image/video data (Xu et al., 2015). In general, there are two types of adaptions of AEs to those complex data. The most straightforward way is to follow the same procedure as the conventional use of AEs with the exception that a particular network architecture tailored for a specific type of data is required to learn effective lowdimensional feature representations, such as CNNAE (Hasan et al., 2016; Zhang et al., 2019), LSTMAE (Malhotra et al., 2016), ConvLSTMAE (Luo et al., 2017) and GCN (graph convolutional network)AE (Ding et al., 2019). This type of AEs embeds the encoderdecoder scheme into the full procedure of these methods. Another type of AEbased approaches is to first use AEs to learn lowdimensional representations of the complex data and then learn to predict these learned representations. The learning of AEs and representation prediction is often two separate steps. These approaches are different from the first type of approaches in that the prediction of representations are wrapped around the lowdimensional representations yielded by AEs. For example, in (Lu et al., 2017), denoising AE is combined with RNNs to learn normal patterns of multivariate sequence data, in which a denoising AE wtih two hidden layers is first used to learn representations of multidimensional data inputs in each time step and a RNN with a simple single hidden layer is then trained to predict the representations yielded by the denoising AE. A similar approach is also used for detecting acoustic anomalies (Marchi et al., 2015), in which a more complex RNN, bidirectional LSTMs, is used.
Advantages. The advantages of data reconstructionbased methods are as follows. (i) The idea of AEs is straightforward and generic to different types of data. (ii) Different types of powerful AE variants can be leveraged to perform anomaly detection.
Disadvantages. Their disadvantages are as follows. (i) The learned feature representations can be biased by infrequent regularities and the presence of outliers or anomalies in the training data. (ii) The objective function of the data reconstruction is designed for dimension reduction or data compression, rather than anomaly detection. As a result, the resulting representations are a generic summarization of underlying regularities, which are not optimized for detecting irregularities.
Challenges Targeted. Different types of neural network layers and architectures can be used under the AE framework, allowing us to detect anomalies in highdimensional data, as well as nonindependent data such as attributed graph data (Ding et al., 2019) and multivariate sequence data (Marchi et al., 2015; Lu et al., 2017) (CH2). These methods may reduce false positives over traditional methods built upon handcrafted features if the learned representations are more expressive (CH1). AEs are generally vulnerable to data noise presented in the training data as they can be trained to remember those noise, leading to severe overfitting and small reconstruction errors of anomalies. The idea of RPCA may be used into AEs to train more robust detection models (Zhou and Paffenroth, 2017) (CH4).
5.1.2. Generative Adversarial Networks
GANbased anomaly detection emerges quickly as one of the popular deep anomaly detection approaches after its early use in (Schlegl et al., 2017). This approach generally aims to learn a latent feature space of the generative network so that the latent space well captures the normality underlying the given data. Some form of residual between the real instance and the generated instance are then defined as anomaly score.
Assumption. Normal data instances can be better generated than anomalies from the latent feature space of the generative network in GANs.
One of the early methods is AnoGAN (Schlegl et al., 2017). The key intuition is that, given any data instances , it aims to search for an instance in the learned latent feature space of the generative network so that the corresponding generated instance and are as similar as possible. Since the latent space is enforced to capture the underlying distribution of training data, anomalies are expected to be less likely to have highly similar generated counterparts than normal instances. Specifically, a GAN is first trained with the following conventional objective:
(8) 
where and are respectively the generator and discriminator networks parameterized by and (the parameters are omitted for brevity), and is the value function of the twoplayer minimax game. After that, for each , to find its best , two loss functions, namely residual loss and discrimination loss, are used to guide the search. The residual loss is defined as
(9) 
while the discrimination loss is defined based on the feature matching technique (Salimans et al., 2016):
(10) 
where is the index of the search iteration step and is a feature mapping from an intermediate layer of the discriminator. The search starts with a randomly sampled , followed by updating based on the gradients derived from the overall loss , where
is a hyperparameter. Throughout this search process, the parameters of the trained GAN are fixed; the loss is only used to update the coefficients of
for the next iteration. The anomaly score is accordingly defined upon the similarity between and obtained at the last step :(11) 
One main issue with AnoGAN is the computational inefficiency in the iterative search of . One effective way to address this issue is to add an extra network that learns the mapping from data instances onto the latent space, i.e., an inverse of the generator, resulting in methods like EBGAN (Zenati et al., 2018a) and fast AnoGAN (Schlegl et al., 2019). These two methods share the same spirit. Here we focus on EBGAN that is built upon the bidirectional GAN (BiGAN) (Donahue et al., 2017). Particularly, in addition to the generator and discriminator , BiGAN has an encoder to map to in the latent space, and simultaneously learn the parameters of , and . Instead of discriminating and , BiGAN aims to discriminate the pair of instances from the pair :
(12) 
After the training, inspired by Eq. (11) in AnoGAN, EBGAN defines the anomaly score as:
(13) 
where and . This eliminates the need to iteratively search in AnoGAN. EBGAN is extended to a method called ALAD (Zenati et al., 2018b) by adding two more discriminators, with one discriminator trying to discriminate the pair from and another one trying to discriminate the pair from .
GANomaly (Akcay et al., 2018) further improves the generator over the previous work by changing the generator network to an encoderdecoderencoder network and adding two more extra loss functions. The generator can be conceptually represented as: , in which is a composition of the encoder and the decoder . In addition to the commonly used feature matching loss:
(14) 
the generator includes a contextual loss and an encoding loss to generate more realistic instances:
(15) 
(16) 
The contextual loss in Eq. (15) enforces the generator to consider the contextual information of the input when generating . The encoding loss in Eq. (16) helps the generator to learn how to encode the features of the generated instances from the training data. The overall loss of the generator is then defined as
(17) 
where , and are the hyperparameters to determine the weight of each individual loss. Since the training data contains mainly normal instances, the encoders and are optimized towards the encoding of normal instances, and thus, the anomaly score can be defined as
(18) 
in which is expected to be large if is an anomaly.
There have been a number of other GANs introduced over the years such as Wasserstein GAN (Arjovsky et al., 2017) and Cycle GAN (Zhu et al., 2017). They may be used to further enhance the anomaly detection performance of the above methods, such as replacing the standard GAN with Wasserstein GAN (Schlegl et al., 2019). Another relevant research line is to adversarially learn endtoend oneclass classification models, which is categorized into the endtoend anomaly score learning framework and discussed in Section 6.4.
Advantages. The advantages of these methods are as follows. (i) GANs have demonstrated superior capability in generating realistic instances, especially on image data, empowering the detection of abnormal instances that are poorly reconstructed from the latent space. (ii) A large number of existing GANbased models and theories (Creswell et al., 2018) may be adapted for anomaly detection.
Disadvantages. Their disadvantages are as follows. (i) The training of GANs can suffer from multiple problems, such as failure to converge and mode collapse (Metz et al., 2017), which leads to to large difficulty in training GANsbased anomaly detection models. (ii) The generator network can be misled and generates data instances out of the manifold of normal instances, especially when the true distribution of the given dataset is complex and/or the training data contains unexpected outliers. (iii) The GANsbased anomaly scores can be suboptimal since they are built upon the generator network with the objective designed for data synthesis rather than anomaly detection.
Challenges Targeted. Similar to AEs, GANbased anomaly detection is able to detect highdimensional anomalies by examining the reconstruction from the learned lowdimensional latent space (CH2). When the latent space preserves important anomaly discrimination information, it helps improve detection accuracy over that in the original data space (CH1).
5.1.3. Predictability Modeling
Predictability modelingbased anomaly detection methods learn feature representations by predicting the current data instances using the representations of the previous instances within a temporal window as the context. In this section data instances are referred to as individual elements in a sequence, e.g., video frames in a video sequence. This technique is widely used for sequence representation learning and prediction (Sutskever et al., 2014; Mathieu et al., 2016; Hsieh et al., 2018; Liao et al., 2018b). To achieve accurate predictions, the feature representations are enforced to capture the temporal/sequential and recurrent dependence within a given sequence length. Normal instances are normally adherent to such dependencies well and can be well predicted, whereas anomalies often violate those dependencies and are unpredictable. Therefore, the prediction errors, e.g., measured by mean squared errors or likelihood values, can be used to define the anomaly scores.
Assumption. Normal instances are more predictable than anomalies given some temporally dependent contexts.
This research line is popular in video anomaly detection (Liu et al., 2018b; Ye et al., 2019; Abati et al., 2019). Video sequence involves complex highdimensional spatialtemporal features. Different constraints over appearance and motion features are needed in the prediction objective function to ensure a faithful prediction of video frames. This deep anomaly detection approach is initially explored in (Liu et al., 2018b). Formally, given a video sequence with consecutive frames , then the learning task is to use all these frames to generate a future frame so that is as close as possible to the ground truth . Its general objective function can be formulated as
(19) 
where , is the frame prediction loss measured by mean squared errors, is an adversarial loss. The popular network architecture named UNet (Ronneberger et al., 2015) is used to instantiate the function for the frame generation. is composed by a set of three separate losses that respectively enforce the closeness between and in three key image feature descriptors: intensity, gradient and optical flow. is due to the the use of adversarial training to enhance the image generation. After training, for a given video frame
, a normalized Peak SignaltoNoise Ratio
(Mathieu et al., 2016) based on the prediction difference is used to define the anomaly score. Under the same framework, an additional autoencoderbased reconstruction network is added in (Ye et al., 2019) to further refine the predicted frame quality, which helps to enlarge the anomaly score difference between normal and abnormal frames.Another research line in this direction is based on the autoregressive models
(Gregor et al., 2014) that assume each element in a sequence is linearly dependent on the previous elements. The autoregressive models are leveraged in (Abati et al., 2019)to estimate the density of training samples in a latent space, which helps avoid the assumption of a specific family of distributions. Specifically, given
and its latent space representation , the autoregressive model factorizes as(20) 
where ,
represents the probability mass function of
conditioned on all the previous instances and is the dimensionality size of the latent space. The objective in (Abati et al., 2019) is to jointly learn an autoencoder and a density estimation network equipped with autoregressive network layers. The overall loss can be represented as(21) 
where the first term is a reconstruction error measured by MSE while the second term is an autoregressive loss measured by the loglikelihood of the representation under an estimated conditional probability density prior. Minimizing this loss enables the learning of the features that are common and easily predictable. At the evaluation stage, the reconstruction error and the loglikelihood is combine to define the anomaly score.
Advantages. The advantages of this category of methods are as follows. (i) A number of sequence learning techniques can be adapted and incorporated into this approach. (ii) This approach enables the learning of different types of temporal and spatial dependencies.
Disadvantages. Their disadvantages are as follows. (i) This approach is limited to anomaly detection in sequence data. (ii) The sequential predictions can be computationally expensive. (iii) The learned representations may suboptimal for anomaly detection as its underlying objective is for sequential predictions rather than anomaly detection.
Challenges Targeted. This approach is particularly designed to learn expressive temporallydependent lowdimensional representations, which helps address the false positives of anomaly detection in highdimensional and/or temporal datasets (CH1 & CH2). The prediction here is conditioned on some elapsed temporal instances, so this category of methods is able to detect temporal contextbased conditional anomalies (CH5).
5.1.4. Selfsupervised Classification
This approach learns representations of normality by building selfsupervised classification models and identifies instances that are inconsistent to or disagree with the classification models as anomalies. This approach is rooted in the crossfeature analysis or feature modelbased anomaly detection (Huang et al., 2003; Noto et al., 2012; TenenboimChekina et al., 2013). These studies evaluate the normality of data instances by their consistency/agreement with a set of predictive (classification/regression) models, with each model learns to predict one feature based on the rest of the other features. The consistency of a given test instance can be measured by the average number of correct predictions or average prediction probability (Huang et al., 2003), the log lossbased surprisal (Noto et al., 2012), or the majority voting of binary decisions (TenenboimChekina et al., 2013) given the classification/regression models across all features. Unlike these studies that focus on tabular data and build the feature models using the original data, deep consistencybased anomaly detection focuses on image data and builds the predictive models by using feature transformationbased augmented data. To effectively discriminate the transformed instances, the classification models are enforced to learn features that are highly important to describe the underlying patterns of the instances presented in the training data. Therefore, normal instances generally have stronger agreements with the classification models.
Assumptions. Normal instances are more consistent to augmented selfsupervised predictive models than anomalies.
This approach is initially explored in (Golan and ElYaniv, 2018). To build the predictive models, different compositions of geometric feature transformation operations, including horizontal flipping, translations, and rotations, is first applied to a set of normal training images. A deep multiclass classification model is trained on the augmented data, treating data instances with a specific transformation operation comes from the same class, i.e., a synthetic class. At the evaluation stage, test instances are augmented with each of transformation compositions, and their normality score is defined by an aggregation of all softmax classification scores to the transformed versions of a given test instance. Its loss function is defined as
(22) 
where is a lowdimensional feature representation of instance augmented by the transformation operation type , is a multiclass classifier parameterized with , is a onehot encoding of the synthetic class assigned to instances that are augmented using the transformation operation , and is a standard crossentropy loss function.
By minimizing Eq. (22), we obtain the representations that are optimized for the classifier . We then can apply the feature learner and the classifier to obtain a classification score for each test instance augmented with a transformation operation . The classification scores of each test instance w.r.t. different are then aggregated to compute the anomaly score. To achieve that, the classification scores conditioned on each is assumed to follow a Dirichlet distribution in (Golan and ElYaniv, 2018) to estimate the consistency of the test instance to the classification model . Actually, as shown in (Golan and ElYaniv, 2018), a simple average of the classification scores associated with different works similarly well as the Dirichletbased anomaly score.
A semisupervised setting, i.e., training data contains normal instances only, is assumed in (Golan and ElYaniv, 2018). A similar idea is explored in the unsupervised setting in (Wang et al., 2019), with the transformation sets containing four transformation operations, i.e., rotation, flipping, shifting and path rearranging. Two key insights revealed in (Wang et al., 2019) is that (i) the gradient magnitude induced by normal instances is normally substantially larger than outliers during the training of such selfsupervised multiclass classification models; and (ii) the network updating direction is also biased towards normal instances. As a result of these two properties, normal instances often have stronger agreement with the classification model than anomalies. Three strategies of using the classification scores to define the anomaly scores, are evaluated, including average prediction probability, maximum prediction probability, and negative entropy across all prediction probabilities (Wang et al., 2019). Their results show that the negative entropybased anomaly scores perform generally better than the other two strategies.
Advantages. The advantages of deep consistencybased methods are as follows. (i) They work well in both the unsupervised and semisupervised settings. (ii) Anomaly scoring is grounded by some intrinsic properties of gradient magnitude and its updating.
Disadvantages. Their disadvantages are as follows. (i) The feature transformation operations are often datadependent. The above transformation operations are applicable to image data only. Different transformation operations need to be explored to adapt this approach to other types of data. (ii) Although the classification model is trained in an endtoend manner, the consistencybased anomaly scores are derived upon the classification scores rather than an integrated unit in the optimization, and thus they may be suboptimal.
Challenges Targeted. The expressive lowdimensional representations of normality this approach learns help detect anomalies better than in the original highdimensional space (CH1 & CH2). Due to some intrinsic differences between anomalies and normal instances presented in the selfsupervised classifiers, this approach is also able to work in an unsupervised setting (Wang et al., 2019), demonstrating good robustness to anomaly contamination in the training data (CH4).
5.2. Anomaly Measuredependent Feature Learning
Anomaly measuredependent feature learning aims at learning feature representations that are specifically optimized for one particular existing anomaly measure. Formally, the framework for this group of methods can be represented as
(23)  
(24) 
where is an existing anomaly scoring measure operating on the representation space. Note that whether may involve trainable parameters or not is dependent on the anomaly measure used. Different from the generic feature learning approach as in Eqs. (23) that calculates anomaly scores based on some heuristics after obtaining the learned representations, this research line directly incorporates an existing anomaly measure into the feature learning objective function to optimize the feature representations specifically for . Below we review representation learning specifically designed for three types of popular anomaly measures, including distancebased measure, oneclass classification measure and clusteringbased measure.
5.2.1. Distancebased Measure
Deep distancebased anomaly detection aims to learn feature representations that are specifically optimized for a specific type of distancebased anomaly measures. Distancebased methods are straightforward and easytoimplement. There have been a number of effective distancebased anomaly measures introduced, e.g., DB outliers (Knorr and Ng, 1999; Knorr et al., 2000), nearest neighbor distance (Ramaswamy et al., 2000a, b), average nearest neighbor distance (Angiulli and Pizzuti, 2002), relative distance (Zhang et al., 2009), and random nearest neighbor distance (Sugiyama and Borgwardt, 2013; Pang et al., 2015)
. One major limitation of these traditional distancebased anomaly measures is that they fail to work effectively in highdimensional data due to the curse of dimensionality. Since deep distancebased anomaly detection techniques project data onto lowdimensional space before applying the distance measures, it can well overcome this limitation.
Assumption. Anomalies are distributed far from their closest neighbors while normal instances are located in dense neighborhoods.
Deep distancebased anomaly detection is first explored in (Pang et al., 2018a), in which the random nearest neighbor distancebased anomaly measure (Sugiyama and Borgwardt, 2013; Pang et al., 2015) is leveraged to drive the learning of lowdimensional representations out of ultrahighdimensional data. Particularly, the key idea is that the representations are optimized so that the nearest neighbor distance of pseudolabeled anomalies in random subsamples is substantially larger than that of pseudolabeled normal instances. The pseudo labels are generated by some offtheshelf anomaly detection methods. Let be a subset of data instances randomly sampled from the dataset , and respectively be the pseudolabeled anomaly and normal instance sets, with and , its loss function is built upon the hinge loss function (Rosasco et al., 2004) and can be represented as
(25) 
where is a predefined constant for the margin between two distances yielded by , which is a random nearest neighbor distance function operated in the representation space:
(26) 
is a hinge loss function augmented by the random nearest neighbor distancebased anomaly measure defined in Eq. (26). Minimizing the loss in Eq. (25) guarantees that the random nearest neighbor distance of anomalies are at least greater than that of normal instances in the based representation space. At the evaluation stage, the random distance in Eq. (26) is used directly to obtain the anomaly score for each test instance. Following this approach, we might also derive similar representation learning tailored for other distancebased measures by replacing Eq. (26) with the other measures, such as the nearest neighbor distance (Ramaswamy et al., 2000b) or the average nearest neighbor distance (Angiulli and Pizzuti, 2002). However, these measures are significantly more computationally costly than the random nearest neighbor distances. Thus, one major challenging for such adaptions would be the prohibitively high computational cost.
Compared to (Pang et al., 2018a) that requires to query the nearest neighbor distances in random data subsets, inspired by (Burda et al., 2019b), a simpler idea explored in (Wang et al., 2020) uses the distance between optimized representations and randomly projected representations of the same instances to guide the representation learning. The objective of the method is as follows
(27) 
where is a random mapping function that is instantiated by the neural network used in with fixed random weights, is a measure of distance between the two representations of the same data instance. As discussed in (Burda et al., 2019b), solving Eq. (27) is equivalent to have a knowledge distillation from a random neural network and helps learn the frequency of different underlying patterns in the data. However, Eq. (27) ignores the relative proximity between data instances and is sensitive to the anomalies presented in the data. To address these two issues, an additional loss function that aims to predict the distance between random instance pairs is added in Eq. (27), and a boosting process is used during the training process to iteratively filter potential anomalies and retrain the representation learning model. At the evaluation stage, is used to compute the anomaly scores.
Advantages. The advantages of this category of methods are as follows. (i) The distancebased anomalies are straightforward and well defined with rich theoretical supports in the literature. Thus, deep distancebased anomaly detection methods can be well grounded due to the strong foundation built in previous relevant work. (ii) They work in lowdimensional representation spaces and can effectively deal with highdimensional data that traditional distancebased anomaly measures fail. (iii) They are able to learn representations specifically tailored for themselves.
Disadvantages. Their disadvantages are as follows. (i) The extensive computation involved in most of distancebased anomaly measures may be an obstacle to incorporate distancebased anomaly measures into the representation learning process. (ii) Their capabilities may be limited by the inherent weaknesses of the distancebased anomaly measures.
Challenges Targeted. This approach is able to learn lowdimensional representations tailored for existing distancebased anomaly measures, addressing the notorious curse of dimensionality in distancebased detection (Zimek et al., 2012) (CH1 & CH2). As shown in (Pang et al., 2018a), an adapted triplet loss can be devised to utilize a few labeled anomaly examples to learn more effective normality representations (CH3). Benefiting from pseudo anomaly labeling, the methods (Pang et al., 2018a; Wang et al., 2020) are also robust to potential anomaly contamination and work effectively in the fully unsupervised setting (CH4).
5.2.2. Oneclass Classificationbased Measure
This category of methods aims to learn feature representations that are customized to subsequent oneclass classificationbased anomaly detection measures. Oneclass classification is referred to as the problem of learning a description of a set of data instances to detect whether new instances conform to the training data or not. It is one of the most popular approaches for anomaly detection (Moya et al., 1993; Schölkopf et al., 2001; Tax and Duin, 2004; Roth, 2005). Most oneclass classification models are inspired by Support Vector Machines (SVM) (Cortes and Vapnik, 1995), such as the two widelyused oneclass models: oneclass SVM (or SVC) (Schölkopf et al., 2001) and Support Vector Data Description (SVDD) (Tax and Duin, 2004). One main research line here is to learn representations that are specifically optimized for these traditional oneclass classification models such as oneclass SVM and SVDD. This is the focus of this section. Another line is to learn an endtoend adversarial oneclass classification model, which will be discussed in Section 6.4.
Assumption. All normal instances come from a single (abstract) class and can be summarized by a compact model, to which anomalies do not conform.
There are a number of studies dedicated to combine oneclass SVM with neural networks (Wu et al., 2019; Nguyen and Vien, 2018; Chalapathy et al., 2018)
. Conventional oneclass SVM is to learn a hyperplane that maximize a margin between training data instances and the origin. The key idea of deep oneclass SVM is to learn the oneclass hyperplane from the neural networkenabled lowdimensional representation space rather than the original input space. Let
, then a generic formulation of the key ideas in (Wu et al., 2019; Nguyen and Vien, 2018; Chalapathy et al., 2018) can be represented as(28) 
where is the margin parameter, are the parameters of a representation network, and (i.e., ) replaces the original dot product that satisfies . Here is a RKHS (Reproducing Kernel Hilbert Space) associated mapping and is a kernel function; is a hyperparameter that can be seen as an upper bound of the fraction of the anomalies in the training data. Any instances that have can be reported as anomalies.
This formulation brings two main benefits: (i) it can leverage (pretrained) deep neural networks to learn more expressive features for downstream anomaly detection, and (iii) it also helps remove the computational expensive pairwise distance computation in the kernel function. As shown in (Wu et al., 2019; Nguyen and Vien, 2018), the reconstruction loss in AEs can be added into Eq. (28) to enhance the expressiveness of the learned representations . As shown in (Rahimi and Recht, 2008), many kernel functions can be approximated with random Fourier features. Motivated by this, before , a further mapping may be applied to to generate random Fourier features, resulting in , which may help achieve a better oneclass SVM model.
Another research line studies deep learning models for SVDD (Ruff et al., 2018, 2020). SVDD aims to learn a minimum hyperplane characterized by a center and a radius so that the sphere contains all training data instances, i.e.,
(29)  
(30) 
Similar to Deep oneclass SVM, Deep SVDD (Ruff et al., 2018) also aims to leverage neural networks to map data instances into the sphere of minimum volume, and then employs the hinge loss function to guarantee the margin between the sphere center and the projected instances. The feature learning and the SVDD objective can then be jointly trained by minimizing the following loss:
(31) 
This assume the training data contains a small proportion of anomaly contamination in the unsupervised setting. In the semisupervised setting, the loss function can be simplified as
(32) 
where directly minimizes the mean distance between the representations of training data instances and the center . Note that including as trainable parameters in Eq. (32) can lead to trivial solutions. It is shown in (Ruff et al., 2018) that can be fixed as the mean of the feature representations yielded by performing a single initial forward pass. Deep SVDD can also be further extended to address another semisupervised setting where a small number of both labeled normal instances and anomalies are available (Ruff et al., 2020). The key idea is to minimize the distance of labeled normal instances to the center while at the same time maximizing the distance of known anomalies to the center. This can be achieved by adding into Eq. (32), where is a labeled instance, when it is a normal instance and otherwise.
Advantages. The advantages of this category of methods are as follows. (i) The oneclass classificationbased anomalies are well studied in the literature and provides a strong foundation of deep oneclass classificationbased methods. (ii) The representation learning and oneclass classification models can be unified to learn tailored and more optimal representations. (iii) They free the users from manually choosing suitable kernel functions in traditional oneclass models.
Disadvantages. Their disadvantages are as follows. (i) The oneclass models may work ineffectively in datasets with complex distributions within the normal class. (ii) The detection performance is dependent on the oneclass classificationbased anomaly measures.
Challenges Targeted. This category of methods enhances detection accuracy by learning lowerdimensional representation space optimized for oneclass classification models (CH1 & CH2). A small number of labeled normal and abnormal data can be leveraged by these methods (Ruff et al., 2020) to learn more effective oneclass description models, which can not only detect known anomalies but also novel classes of anomaly (CH3).
5.2.3. Clusteringbased Measure
Deep clusteringbased anomaly detection aims at learning representations so that anomalies are clearly deviated from the clusters in the newly learned representation space. The task of clustering and anomaly detection is naturally tied with each other, so there have been a large number of studies dedicated to using clustering results to define anomalies, e.g., cluster size (Jiang et al., 2001), distance to cluster centers (He et al., 2003), distance between cluster centers (Jiang et al., 2006), and cluster membership (Schubert et al., 2017)
. Gaussian mixture modelbased anomaly detection
(Mahadevan et al., 2010; Emmott et al., 2013) is also included into this category due to some of its intrinsic relations to clustering, e.g., the likelihood fit in the Gaussian mixture model (GMM) corresponds to an aggregation of the distances of data instances to the centers of the Gaussian clusters/components (Aggarwal, 2017).Assumptions. Normal instances have stronger adherence to clusters than anomalies.
Deep clustering, which aims to learn feature representations tailored for a specific clustering algorithm, is the most critical component of this anomaly detection method. A number of studies have explored this problem in recent years (Tian et al., 2014; Xie et al., 2016; Yang et al., 2016; Dilokthanakul et al., 2017; Ghasedi Dizaji et al., 2017; Caron et al., 2018; Yang et al., 2019). The main motivation is due to the fact that the performance of clustering methods is highly dependent on the input data. Learning feature representations specifically tailored for a clustering algorithm can well guarantee its performance on different datasets (Aljalbout et al., 2018). In general, there are two key intuitions here: (i) good representations enables better clustering and good clustering results can provide effective supervisory signals to representation learning; and (ii) representations that are optimized for one clustering algorithm is not necessarily useful for other clustering algorithms due to the difference of the underlying assumptions made by the clustering algorithms.
The deep clustering methods typically consist of two modules: performing clustering in the forward pass and learning representations using the cluster assignment as pseudo class labels in the backward pass. Its loss function is often the most critical part, which can be generally formulated as
(33) 
where is a clustering loss function, within which is the feature learner parameterized by , is a clustering assignment function parameterized by and represents pseudo class labels yielded by the clustering; is a nonclustering loss function used to enforce additional constrains on the learned representations; and and are two hyperparameters to control the importance of the two losses. can be instantiated with a means loss (Xie et al., 2016; Caron et al., 2018)
, a spectral clustering loss
(Tian et al., 2014; Yang et al., 2019), an agglomerative clustering loss (Yang et al., 2016), or a GMM loss (Dilokthanakul et al., 2017), enabling the representation learning for the specific targeted clustering algorithm. is often instantiated with an autoencoderbased reconstruction loss (Ghasedi Dizaji et al., 2017; Yang et al., 2019) to learn robust and/or local structure preserved representations, or to prevent collapsing clusters.After the deep clustering, the cluster assignments in the resulting function can then be utilized to compute anomaly scores based on (Jiang et al., 2001; He et al., 2003; Jiang et al., 2006; Schubert et al., 2017). However, it should be noted that the deep clustering may be biased by anomalies if the training data is anomalycontaminated. Therefore, the above methods can be applied to the semisupervised setting where the training data is composed by normal instances only. In the unsupervised setting, some extra constrains are required in and/or to eliminate the impact of potential anomalies.
The aforementioned deep clustering methods are focused on learning optimal clustering results. Although their resulting clustering results are applicable to anomaly detection, the learned representations may not be able to well capture the abnormality of anomalies. It is important to utilize clustering techniques to learn representations so that anomalies have clearly weaker adherence to clusters than normal instances. Some promising results for this type of approach are shown in (Zong et al., 2018; Liao et al., 2018a), in which they aim to learn representations for a GMMbased model with the representations optimized to highlight anomalies. The general formation of these two studies is similar to Eq. (33) with and respectively specified as a GMM loss and an autoencoderbased reconstruction loss, but to learn deviated representations of anomalies, they concatenate some handcrafted features based on the reconstruction errors from the autoencoders with the learned features of the autoencoder to optimize the combined features together. Since the reconstruction errorbased handcrafted features capture the data normality, the resulting representations are more optimal for anomaly detection than that yielded by other deep clustering methods.
Advantages. The advantages of deep clusteringbased methods are as follows. (i) A number of deep clustering methods and theories can be utilized to support the effectiveness and theoretical foundation of anomaly detection. (ii) Compared to traditional clusteringbased methods, deep clusteringbased methods learn specifically optimized representations that help spot the anomalies easier than on the original data, especially when dealing with intricate data sets.
Disadvantages. Their disadvantages are as follows. (i) The performance of anomaly detection is heavily dependent on the clustering results. (ii) The clustering process may be biased by contaminated anomalies in the training data, which in turn leads to less effective representations.
Challenges Targeted. The clusteringbased anomaly measures are applied to newly learned lowdimensional representations of data inputs; when the new representation space preserves sufficient discrimination information, the deep methods can achieve better detection accuracy than that in the original data space (CH1 & CH2). Some clustering algorithms are sensitive to outliers, so the deep clustering and the subsequent anomaly detection can be largely misled when the given data is contaminated by anomalies, but as shown in (Zong et al., 2018), deep clustering using handcrafted features from the reconstruction errors of deep autoencoders may help learn more robust detection models w.r.t. anomaly contamination (CH4).
6. Endtoend Anomaly Score Learning
This research line aims at learning scalar anomaly scores in an endtoend fashion. Compared to anomaly measuredependent feature learning, the anomaly scoring in this type of approach is not dependent on existing anomaly measures; it has a neural network that directly learns the anomaly scores. Novel loss functions are often required to drive the anomaly scoring network. Formally, this category of methods aims at learning an endtoend anomaly score learning network: . The underlying framework can be represented as
(34)  
(35) 
Below we review four main approaches to fulfill the goal in Eqs. (3435): ranking models, priordriven models, softmax models and endtoend oneclass classification models. The key to this framework is to incorporate order or discriminative information into the anomaly scoring network. These four approaches represent four different perspectives to design this network.
6.1. Ranking Models
This group of methods aims to directly learn a ranking model, such that data instances can be sorted based on an observable ordinal variable associated with the absolute/relative ordering relation of the abnormality. The anomaly scoring neural network is driven by the observable ordinal variable.
Assumptions. There exists an observable ordinal variable that captures some data abnormality.
One research line of this approach is to devise ordinal regressionbased loss functions to drive the anomaly scoring neural network (Pang et al., 2019b, 2020). In (Pang et al., 2020), a selftrained deep ordinal regression model is introduced to directly optimize the anomaly scores for unsupervised video anomaly detection. Particularly, it assumes an observable ordinal variable with , let , and respectively be pseudo anomaly and normal instance sets and , then the objective function is formulated as
(36) 
where is a MSE/MAEbased loss function and and . Here takes two scalar ordinal values only, so it is a twoclass ordinal regression.
The endtoend anomaly scoring network takes and as inputs and learns to optimize the anomaly scores such that the data inputs of similar behaviors as those in () receive large (small) scores as close () as possible, resulting in larger anomaly scores assigned to anomalous frames than normal frames.
Due to the superior capability of capturing appearance features of image data, ResNet50 (He et al., 2016) is used to specify the feature network , followed by the anomaly scoring network built with a fully connected twolayer neural network. consists of a hidden layer with 100 units and an output layer with a single linear unit. Similar to (Pang et al., 2018a), and are initialized by some existing anomaly measures. The anomaly scoring model is then iteratively updated and enhanced in a selftraining manner. The MAEbased loss function is employed in Eq. (36) to reduce the negative effects brought by false pseudo labels in and .
Different from (Pang et al., 2020) that addresses an unsupervised setting, a weaklysupervised setting is assumed in (Pang et al., 2019b; Sultani et al., 2018). A very small number of labeled anomalies, together with largescale unlabeled data, is assumed to be available during training in (Pang et al., 2019b). To leverage the known anomalies, the anomaly detection problem is formulated as a pairwise relation prediction task. Specifically, a twostream ordinal regression network is devised to learn the relation of randomly sampled pairs of data instances, i.e., to discriminate whether the instance pair contains two labeled anomalies, one labeled anomaly, or just unlabeled data instances. Let be the small labeled anomaly set, be the large unlabeled dataset and , is first generated. Here is a set of random instance pairs with synthetic ordinal class labels, where is an ordinal variable. The synthetic label means an ordinal value for any instance pairs with the instances and respectively sampled from and . is predefined such that the pairwise prediction task is equivalent to anomaly score learning. The method can then be formally framed as
(37) 
which is trainable in an endtoend fashion. By minimizing Eq. (37), the model is optimized to learn larger anomaly scores for the input pairs that contain two anomalies than the pairs with one anomaly or none. At the evaluation stage, each test instance is paired with instances from or to obtain the anomaly scores.
The weaklysupervised setting in (Sultani et al., 2018) addresses framelevel video anomaly detection but only a set of videolevel class labels is available during training, i.e., a video is normal or contains abnormal frames somewhere, but we do not know which specific frames are anomalies. A multiple instance learning (MIL)based ranking model is introduced in (Sultani et al., 2018) to harness the highlevel class labels to directly learn the anomaly score for each video segment (i.e., a small number of consecutive video frames). Its key objective is to guarantee that the maximum anomaly score for the segments in a video that contains anomalies somewhere is greater than the counterparts in a normal video. To achieve this, each video is treated as a bag of instances in MIL, the videos that contains anomalies are treated as positive bags, and the normal videos are treated as negative bags. Each video segment is an instance in the bag. The ordering information of the anomaly scores is enforced as a relative pairwise ranking order via the hinge loss function. The overall objective function is defined as
(38) 
where is a video segment, contains a bag of video segments, and and respectively represents positive and negative bags. The first term is to guarantee the relative anomaly score order, i.e., the anomaly score of the most abnormal video segment in the positive instance bag is greater than that in the negative instance bag. The second and the last terms are two extra optimization constraints, in which the former one enforces score smoothness between consecutive video segments while the latter one enforces anomaly sparsity, i.e., each video contains only a few abnormal segments.
Advantages. The advantages of deep ranking modelbase methods are as follows. (i) The anomaly scores can be optimized directly with adapted loss functions. (ii) They are generally free from the definitions of anomalies by imposing a weak assumption of the ordinal order between anomaly and normal instances. (iii) This approach may build upon wellestablished ranking techniques and theories from areas like learning to rank (Liu et al., 2009; Joachims et al., 2017; Liu et al., 2018c).
Disadvantages. Their disadvantages are as follows. (i) At least some form of labeled anomalies are required in these methods, which may not be applicable to applications where such labeled anomalies are not available. The method in (Pang et al., 2020) is fully unsupervised and obtains some promising performance but there is still a large gap compared to semisupervised methods. (ii) Since the models are exclusively fitted to detect the few labeled anomalies, they may not be able to generalize to unseen anomalies that exhibit different abnormal features to the labeled anomalies.
Challenges Targeted: Using weak supervision such as pseudo labels or noisy class labels provide some important knowledge of suspicious anomalies, enabling the learning of more expressive lowdimensional representation space and better detection accuracy (CH1, CH2). The MIL scheme (Sultani et al., 2018) and the pairwise relation prediction (Pang et al., 2019b) provide an easy way to incorporate coarsegrained/limited anomaly labels to detection model learning (CH3
). More importantly, the endtoend anomaly score learning offers straightforward anomaly explanation by backpropagating the activation weights or the gradient of anomaly scores to locate the features that are responsible for large anomaly scores
(Pang et al., 2020) (CH6). In addition, the methods in (Pang et al., 2019b, 2020) also work well in data with anomaly contamination or noisy labels (CH4).6.2. Priordriven Models
This approach uses a prior distribution to encode and drive the anomaly score learning. Since the anomaly scores are learned in an endtoend manner, the prior may be imposed on either the internal module or the learning output (i.e., anomaly scores) of the score learning function .
Assumptions. The imposed prior captures the underlying (ab)normality of the dataset.
The incorporation of the prior into the internal anomaly scoring function is exemplified by a recent study on the Bayesian inverse reinforcement learningbased sequential anomaly detection
(Oh and Iyengar, 2019). The key intuition of the idea is that given an agent that takes a set of sequential data as input, the agent’s normal behavior can be understood by its latent reward function, and thus a test sequence is identified as anomaly if the agent assigns a low reward to the sequence. Inverse reinforcement learning (IRL) approaches (Ng and Russell, 2000) are used to infer the reward function. To learn the reward function more efficiently, a samplebased IRL approach is used. Specifically, the IRL problem is formulated as the below posterior optimization problem(39) 
where , is a latent reward function parameterized by , is a pair of state and action in the sequence , represents the partition function which is the integral of
over all the sequences consistent with the underlying Markov decision process dynamics,
is a prior distribution of , and is a set of observed sequences. Since the inverse of the reward yielded by is used as the anomaly score, maximizing Eq. (39) is equivalent to directly learning the anomaly scores.At the training stage, a Gaussian prior distribution over the weight parameters of the reward function learning network is assumed, i.e., . The partition function is estimated using a set of sequences generated by a samplegenerating policy ,
(40) 
The policy is also represented as a neural network. and are alternatively optimized, i.e., to optimize the reward function with a fixed policy and to optimize with the updated reward function . Note that is instantiated with a bootstrap neural network with multiple output heads in (Oh and Iyengar, 2019); Eq. (39) presents a simplified for brevity.
The idea of enforcing a prior on the anomaly scores is explored in (Pang et al., 2019a). Motivated by the extensive empirical results in (Kriegel et al., 2011)
that show the anomaly scores in a variety of realworld data sets fits Gaussian distribution very well, the work uses a Gaussian prior to encode the anomaly scores and enable the direct optimization of the scores. That is, it is assumed that the anomaly scores of normal instances are clustered together while that of anomalies deviate far away from this cluster. The prior is leveraged to define the following loss function, called deviation loss, which is built upon the wellknown contrastive loss
(Hadsell et al., 2006).(41)  
(42) 
where and
are respectively the estimated mean and the standard deviation of the Gaussian prior
, if is an anomaly and if is a normal object, andis equivalent to a ZScore confidence interval parameter.
and are estimated using a set of values drawn from for each batch of instances to learn robust representations of normality and abnormality.The detection model is driven by the deviation loss to push the anomaly scores of normal instances as close as possible to while guaranteeing at least standard deviations between and the anomaly scores of anomalies. When is an anomaly and it has a negative , the loss would be particularly large, resulting in large positive deviations for all anomalies. As a result, the deviation loss is equivalent to enforcing a statistically significant deviation of the anomaly score of the anomalies from that of normal instances in the upper tail. In addition to the endtoend anomaly score learning, this Gaussian priordriven loss also results in well interpretable anomaly scores, i.e., given any anomaly score , we can use the Zscore confidence interval to explain the abnormality of the instance . This is an important and very practical property that existing methods do not have.
Advantages. The advantages of priordriven models are as follows. (i) The anomaly scores can be directly optimized w.r.t. a given prior. (ii) It provides a flexible framework for incorporating different prior distributions into the anomaly score learning. Different Bayesian deep learning techniques (Wang and Yeung, 2016) may be adapted for anomaly detection. (iii) The prior can also result in more interpretable anomaly scores than the other methods.
Disadvantages. Their disadvantages are as follows. (i) It is difficult, if not impossible, to design a universally effective prior for different anomaly detection application scenarios. (ii) The models may work less effectively if the prior does not fit the underlying distribution well.
Challenges Targeted: The prior empowers the models to learn informed lowdimensional representations of different complex data such as highdimensional data and sequential data (CH1 & CH2). By imposing a prior over anomaly scores, the deviation network method (Pang et al., 2019a) shows promising performance in leveraging a limited amount of labeled anomaly data to enhance the representations of normality and abnormality, substantially boosting the detection recall (CH1 & CH3). The detection models here are driven by a prior distribution w.r.t. anomaly scoring function and work well in data with anomaly contamination in the training data (CH4).
6.3. Softmax Models
This type of approach aims at learning anomaly scores by maximizing the likelihood of events in the training data. Since anomaly and normal instances respectively correspond to rare and frequent patterns, from the probabilistic perspective, normal instances are presumed to be highprobability events whereas anomalies are prone to be lowprobability events. Therefore, the negative of the event likelihood can be naturally defined as anomaly score.
Assumptions. Anomalies and normal instances are respectively low and highprobability events.
The idea of learning anomaly scores by directly modeling the event likelihood is introduced in (Chen et al., 2016). Particularly, the problem is framed as
(43) 
where is the probability of the instance (i.e., an event in the event space) with the parameters to be learned. To easy the optimization, is modeled with a softmax function:
(44) 
where is an anomaly scoring function designed to capture pairwise feature interactions:
(45) 
where is a lowdimensional embedding of the th feature value of in the representation space , is the weight added to the interaction and is a trainable parameter. Since is a normalization term, learning the likelihood function is equivalent to directly optimizing the anomaly scoring function
. The computation of this explicit normalization term is prohibitively costly, the wellestablished noisecontrastive estimation (NCE)
(Gutmann and Hyvärinen, 2010) is used in (Chen et al., 2016) to learn the following approximated likelihood(46) 
where and ; for each instance , noise samples are generated from some synthetic known ‘noise’ distribution . In (Chen et al., 2016), a contextdependen method is used to generate the negative samples by univariate extrapolation of the observed instance .
The method is primarily designed to detect anomalies in categorical data (Chen et al., 2016). Motivated by this application, a similar objective function is adapted to detect abnormal events in heterogeneous attributed bipartite graphs (Fan et al., 2018). The problem in (Fan et al., 2018) is to detect anomalous paths that span both partitions of the bipartite graph. Therefore, in Eq. (45) is a graph path containing a set of heterogeneous graph nodes, with and be the representations of every pair of the nodes in the path. To map attributed nodes into the representation space , multilayer perceptron networks and autoencoders are respectively applied to the node features and the graph topology.
Advantages. The advantages of softmax modelbased methods are as follows. (i) Different types of interactions can be incorporated into the anomaly score learning process. (ii) The anomaly scores are faithfully optimized w.r.t. the specific abnormal interactions we aim to capture.
Disadvantages. Their disadvantages are as follows. (i) The computation of the interactions can be very costly when the number of features/elements in each data instance is large, i.e., we have time complexity per instance for th order interactions of features/elements. (ii) The anomaly score learning is heavily dependent on the quality of the generation of negative samples.
Challenges Targeted: The formulation in this category of methods provides a promising way to learn lowdimensional representations of datasets with heterogeneous data sources (CH2 & CH5). The learned representations often capture more normality/abnormality information from different data sources and thus enable better detection than traditional methods (CH1).
6.4. Endtoend Oneclass Classification
This category of approach aims to train a oneclass classifier that learns to discriminate whether a given instance is normal or not in an endtoend manner. Different from the methods in Section 5.2.2, this group of methods does not rely on any existing oneclass classification measures such as oneclass SVM or SVDD; it directly learns a oneclass discrimination model. This approach emerges mainly due to the marriage of GANs and the concept of oneclass classification, i.e., adversarially learned oneclass classification. The key idea is to learn a oneclass discriminator of the normal instances so that it well discriminates from some adversarially generated pseudo anomalies. Note that this approach is also very different from the GANbased methods in Section 5.1.2 due to two key differences. First, the GANbased methods aim to learn a generative distribution to maximally approximate the real data distribution, achieving a generative model that well captures the normality of the training normal instances; while the methods in this section aim to optimize a discriminative model to separate training normal instances from adversarially generated fringe instances. Second, the GANbased methods define the anomaly scores based on the residual between the real instances and the corresponding generated instances, whereas the methods here directly use the discriminative model to classify anomalies, i.e., the discriminator acts as in Eq. (34). This section is separated from Sections 5.1.2 and 5.2.2 to highlight the above differences.
Assumptions. Two basic assumptions of this approach are as follows. (i) Data instances that are approximated to anomalies can be effectively synthesized. (ii) All normal instances can be summarized by a discriminative oneclass model.
The idea of adversarially learned oneclass (ALOCC) classification is first studied in (Sabokrou et al., 2018). The key idea is to train two deep networks, with one network trained as the oneclass model to separate normal instances from anomalies while the other network trained to enhance the normal instances and generate distorted outliers. The two networks are instantiated and optimized through the GANs approach. The oneclass model is built upon the discriminator network and the generator network is based on a denoising AE (Vincent et al., 2010). The objective of the AEempower GAN is defined as
(47) 
where denotes a data distribution of corrupted by a Gaussian noise, i.e., with . This objective is jointly optimized with the following data construction error in .
(48) 
The intuition in Eq. (47) is that can well reconstruct (and even enhance) normal instances, but it can be confused by input outliers and consequently generates distorted outliers. Through the minimax optimization, the discriminator learns to better discriminate normal instances from the outliers than using the original data instances. Thus, can be directly used to detect anomalies. In (Sabokrou et al., 2018) the outliers are randomly drawn from some classes other than the classes where the normal instances come from.
However, obtaining the reference outliers beyond the given training data as in (Sabokrou et al., 2018) may be unavailable in many domains. Instead of taking random outliers from other datasets, we can generate fringe data instances based on the given training data and use them as negative reference instances to enable the training of the oneclass discriminator. This idea is explored in (Zheng et al., 2019; Ngo et al., 2019). Oneclass adversarial networks (OCAN) is introduced in (Zheng et al., 2019) to leverage the idea of bad GANs (Dai et al., 2017) to generate fringe instances based on the distribution of the normal training data. Unlike conventional generators in GANs, the generator network in bad GANs is trained to generate data instances that are complementary, rather than matching, to the training data. The objective of the complement generator is as follows
(49) 
where is the entropy, is an indicator function, is a threshold hyperparameter, and
is a feature mapping derived from an intermediate layer of the discriminator. The first two terms are devised to generate lowdensity instances in the original feature space. However, it is computationally infeasible to obtain the probability distribution of the training data. Instead the density estimation
is approximated by the discriminator of a regular GAN. The last term is the widelyused feature matching loss that helps better generate data instances within the original data space. The objective of the discriminator in OCAN is enhanced with an extra conditional entropy term to enable the detection with high confidence:(50) 
In (Ngo et al., 2019), Fence GAN is introduced with the objective to generate data instances tightly lying at the boundary of the distribution of the training data. This is achieved by introducing two loss functions into the generator that enforce the generated instances to be evenly distributed along a sphere boundary of the training data. Formally, the objective of the generator is defined as
(51) 
where is a hyperparameter used as a discrimination reference score for the generator to generate the boundary instances and is the center of the generated data instances. The first term is called encirclement loss that enforces the generated instances to have the same discrimination score, ideally resulting in instances tightly enclosing the training data. The second term is called dispersion loss that enforces the generated instances to evenly cover the whole boundary.
There have been some other methods introduced to effectively generate the reference instances. For example, uniformly distributed instances can be generated to enforce the normal instances to be distributed uniformly across the latent space
(Perera et al., 2019); an ensemble of generators is used in (Liu et al., 2019), with each generator synthesizing boundary instances for one specific cluster of normal instances.Advantages. The advantages of this category of methods is as follows. (i) Its anomaly classification model is adversarially optimized in an endtoend fashion. (ii) It can be developed and supported by the affluent techniques and theories of adversarial learning and oneclass classification.
Disadvantages. Their disadvantages are as follows. (i) It is difficult to guarantee that the generated reference instances well resemble the unknown anomalies. (ii) The instability of GANs may result in the generation of instances with diverse quality, leading to unstable discriminatorbased anomaly classification performance. This issue is recently studied in (Zaheer et al., 2020), which shows that the performance of this type of anomaly detectors can fluctuate drastically in different training steps. (iii) Its applications are limited to semisupervised anomaly detection scenarios.
Challenges Targeted: The adversially learned oneclass classifiers learn to generate realistic fringe/boundary instances, enabling the learning of expressive lowdimensional normality representations (CH1 & CH2).
7. Algorithms and Datasets
7.1. Representative Algorithms
To gain a more indepth understanding of methods in this area, we present a summary of key characteristics of representative algorithms from each category of approach in Table 1. Some main observations include: (i) most methods operate in an unsupervised or semisupervised mode; (ii) deep learning tricks like data augmentation, dropout and pretraining are underexplored; (iii) the network architecture used in most of the methods is not that deep, with a majority of the methods having no more than five network layers; (iv) ReLU or leaky ReLU is the most popular activation function; and (v) deep learning can be leveraged to detect anomalies in different types of input data. The source code of most of these representative algorithms is publicly accessible. We summarize the source of the codes in Table 2 to facilitate the access.
Method  Ref.  Sup.  Objective  DA  DP  PT  Archit.  Activation  # layers  Loss  Data 

OADA  (Ionescu et al., 2019) (4)  Semi  Reconstruction  Yes  No  No  AE  ReLU  3  MSE  Video 
Replicator  (Hawkins et al., 2002) (5.1.1)  Unsup.  Reconstruction  No  No  No  AE  Tanh  2  MSE  Tabular 
RandNet  (Chen et al., 2017) (5.1.1)  Unsup.  Reconstruction  No  Yes  Yes  AE  ReLU  3  MSE  Tabular 
RDA  (Zhou and Paffenroth, 2017) (5.1.1)  Semi  Reconstruction  No  No  No  AE  Sigmoid  2  MSE  Tabular 
UODA  (Lu et al., 2017) (5.1.1)  Semi  Reconstruction  No  No  Yes  AE & RNN  Sigmoid  4  MSE  Sequence 
AnoGAN  (Schlegl et al., 2017) (5.1.2)  Semi  Generative  No  No  No  Conv.  ReLU  4  MAE  Image 
EBGAN  (Zenati et al., 2018a) (5.1.2)  Semi  Generative  No  No  No  Conv. & MLP  ReLU/lReLU  34  GAN  Image & Tabular 
FFP  (Liu et al., 2018b) (5.1.3)  Semi  Predictive  Yes  No  Yes  Conv.  ReLU  10  MAE/MSE  Video 
LSA  (Abati et al., 2019) (5.1.3)  Semi  Predictive  No  No  No  Conv.  lReLU  47  MSE & KL  video 
GT  (Golan and ElYaniv, 2018) (5.1.4)  Semi  Classification  Yes  Yes  No  Conv.  ReLU  1016  CE  Image 
EOutlier  (Wang et al., 2019) (5.1.4)  Semi  Classification  Yes  Yes  No  Conv.  ReLU  10  CE  Image 
REPEN  (Pang et al., 2018a) (5.2.1)  Unsup.  Distance  No  No  No  MLP  ReLU  1  Hinge  Tabular 
RDP  (Wang et al., 2020) (5.2.1)  Unsup.  Distance  No  No  No  MLP  lReLU  1  MSE  Tabular 
AE1SVM  (Nguyen and Vien, 2018) (5.2.2)  Unsup.  Oneclass  No  No  No  AE & Conv.  Sigmoid  25  Hinge  Tabular & image 
DeepOC  (Wu et al., 2019)(5.2.2)  Semi  Oneclass  No  No  No  3D Conv.  ReLU  5  Hinge  Video 
Deep SVDD  (Ruff et al., 2018) (5.2.2)  Semi  Oneclass  No  No  Yes  Conv.  lReLU  34  Hinge  Image 
Deep SAD  (Ruff et al., 2020) (5.2.2)  Semi  Oneclass  No  No  Yes  Conv. & MLP  lReLU  34  Hinge  Image & Tabular 
DEC  (Xie et al., 2016) (5.2.3)  Unsup.  Clustering  No  Yes  Yes  MLP  ReLU  4  KL  Image & Tabular 
DAGMM  (Zong et al., 2018) (5.2.3)  Unsup.  Clustering  No  Yes  No  AE & MLP  Tanh  46  Likelihood  Tabular 
SDOR  (Pang et al., 2020) (6.1)  Unsup.  Anomaly scores  No  No  Yes  ResNet & MLP  ReLU  50 + 2  MAE  Video 
PReNet  (Pang et al., 2019b) (6.1)  Weak  Anomaly scores  Yes  No  No  MLP  ReLU  24  MAE  Tabular 
MIL  (Sultani et al., 2018) (6.1)  Weak  Anomaly scores  No  Yes  Yes  3DConv. & MLP  ReLU  18/34 + 3  Hinge  Video 
PUP  (Oh and Iyengar, 2019) (6.2)  Unsup.  Anomaly scores  No  No  No  MLP  ReLU  3  Likelihood  Sequence 
DevNet  (Pang et al., 2019a) (6.2)  Weak  Anomaly scores  No  No  No  MLP  ReLU  24  Deviation  Tabular 
APE  (Chen et al., 2016) (6.3)  Unsup.  Anomaly scores  No  No  No  MLP  Sigmoid  3  Softmax  Tabular 
AEHE  (Fan et al., 2018) (6.3)  Unsup.  Anomaly scores  No  No  No  AE & MLP  ReLU  4  Softmax  Graph 
ALOCC  (Sabokrou et al., 2018) (6.4)  Semi  Anomaly scores  Yes  No  No  AE & CNN  lReLU  5  GANs  Image 
OCAN  (Zheng et al., 2019) (6.4)  Semi  Anomaly scores  No  No  Yes  LSTMAE & MLP  ReLU  4  GANs  Sequence 
Fence GAN  (Ngo et al., 2019) (6.4)  Semi  Anomaly scores  No  Yes  No  Conv. & MLP  lReLU/Sigmoid  45  GANs  Image & Tabular 
OCGAN  (Perera et al., 2019) (6.4)  Semi  Anomaly scores  No  No  No  Conv.  ReLU/Tanh  3  GANs  Image 
Method  API  Link  Section 

RDA (Zhou and Paffenroth, 2017)  Tensorflow  https://git.io/JfYG5  5.1.1 
AnoGAN (Schlegl et al., 2017)  Tensorflow  https://git.io/JfGgc  5.1.2 
Fast AnoGAN (Schlegl et al., 2019)  Tensorflow  https://git.io/JfZRn  5.1.2 
EBGAN (Zenati et al., 2018a)  Keras  https://git.io/JfGgG  5.1.2 
ALAD (Zenati et al., 2018b)  Keras  https://git.io/JfZ8v  5.1.2 
GANomaly (Akcay et al., 2018)  PyTorch  https://git.io/JfGgn  5.1.2 
FFP (Liu et al., 2018b)  Tensorflow  https://git.io/Jf4pc  5.1.3 
LSA (Abati et al., 2019)  Torch  https://git.io/Jf4pW  5.1.3 
GT (Golan and ElYaniv, 2018)  Keras  https://git.io/JfZRW  5.1.4 
EOutlier (Wang et al., 2019)  PyTorch  https://git.io/Jf4pl  5.1.4 
REPEN (Pang et al., 2018a)  Keras  https://git.io/JfZRg  5.2.1 
RDP (Wang et al., 2020)  PyTorch  https://git.io/RDP  5.2.1 
AE1SVM (Nguyen and Vien, 2018)  Tensorflow  https://git.io/JfGgl  5.2.2 
OCNN (Chalapathy et al., 2018)  Keras  https://git.io/JfGgZ  5.2.2 
Deep SVDD (Ruff et al., 2018)  Tensorflow  https://git.io/JfZRR  5.2.2 
Deep SAD (Ruff et al., 2020)  PyTorch  https://git.io/JfOkr  5.2.2 
DAGMM (Zong et al., 2018)  PyTorch  https://git.io/JfZR0  5.2.3 
MIL (Sultani et al., 2018)  Keras  https://git.io/JfZRz  6.1 
DevNet (Pang et al., 2019a)  Keras  https://git.io/JfZRw  6.2 
ALOCC (Sabokrou et al., 2018)  Tensorflow  https://git.io/Jf4p4  6.4 
OCAN (Zheng et al., 2019)  Tensorflow  https://git.io/JfYGb  6.4 
FenceGAN (Ngo et al., 2019)  Keras  https://git.io/Jf4pR  6.4 
OCGAN (Perera et al., 2019)  MXNet  https://git.io/Jf4p0  6.4 
Links to Access 23 Opensource Algorithms.
7.2. Datasets with Real Anomalies
One main obstacle to the development of anomaly detection is the lack of realworld datasets with real anomalies. Many studies (e.g., (Zhou and Paffenroth, 2017; Zenati et al., 2018a; Akcay et al., 2018; Golan and ElYaniv, 2018; Wang et al., 2019; Ruff et al., 2018; Ngo et al., 2019)) evaluate the performance of their presented methods on datasets converted from popular classification data for this reason. This way may fail to reflect the performance of the methods in realworld anomaly detection applications as the characteristics of anomalies in the converted data can be different from the real anomalies in practice. We summarize a collection of 21 publicly available realworld datasets with real anomalies in Table 3 to promote the performance evaluation on these datasets. The datasets cover a wide range of popular application domains presented in a variety of data types. Only largescale and/or highdimensional complex datasets are included here to provide challenging testbeds for deep anomaly detection methods.
Domain  Data  Size  Dimension  Anomaly (%)  Type  Reference 

Intrusion detection  KDD Cup 99 (Bache and Lichman, 2013)  4,091567,497  41  0.30%7.70%  Tabular  (Hawkins et al., 2002; Zong et al., 2018; Nguyen and Vien, 2018; Ngo et al., 2019) 
Intrusion detection  UNSWNB15 (Moustafa and Slay, 2015)  257,673  49  9.71%  Streaming  (Pang et al., 2019b, a) 
Excitement prediction  KDD Cup 14  619,326  10  6.00%  Tabular  (Pang et al., 2019b, a) 
Dropout prediction  KDD Cup 15  35,091  27  0.10%0.40%  Sequence  (Lu et al., 2017) 
Malicious URLs detection  URL (Ma et al., 2009)  2.4m  3.2m  33.04%  Streaming  (Pang et al., 2018a) 
Spam detection  Webspam (Webb et al., 2006)  350,000  16.6m  39.61%  Tabular/text  (Pang et al., 2018a) 
Fraud detection  Creditcardfraud (Dal Pozzolo et al., 2017)  284,807  30  0.17%  Streaming  (Zheng et al., 2019; Pang et al., 2019b, a) 
Vandal detection  UMDWikipedia (Kumar et al., 2015)  34,210  N/A  50.00%  Sequence  (Zheng et al., 2019) 
Mutant activity detection  p53 Mutants (Danziger et al., 2009)  16,772  5,408  0.48%  Tabular  (Pang et al., 2018a) 
Internet ads detection  AD (Bache and Lichman, 2013)  3,279  1,555  14.00%  Tabular  (Pang et al., 2018a) 
Disease detection  Thyroid (Bache and Lichman, 2013)  7,200  21  7.40%  Tabular  (Pang et al., 2019b, a; Zong et al., 2018; Ruff et al., 2020) 
Disease detection  Arrhythmia (Bache and Lichman, 2013)  452  279  14.60%  Tabular  (Pang et al., 2015; Zong et al., 2018; Ruff et al., 2020) 
Defect detection  MVTec AD  5,354  N/A  35.26%  Image  (Bergmann et al., 2019) 
Video surveillance  UCSD Ped 1 (Li et al., 2013)  14,000 frames  N/A  28.6%  Video  (Pang et al., 2020; Wu et al., 2019) 
Video surveillance  UCSD Ped 2 (Li et al., 2013)  4,560 frames  N/A  35.9%  Video  (Pang et al., 2020; Wu et al., 2019) 
Video surveillance  UMN (of Minnesota, 2020)  7,739 frames  N/A  15.5% 18.1%  Video  (Pang et al., 2020) 
Video surveillance  Avenue (Lu et al., 2013)  30,652 frames  N/A  12.46%  Video  (Wu et al., 2019) 
Video surveillance  ShanghaiTech Campus (Liu et al., 2018b)  317,398 frames  N/A  5.38%  Video  (Liu et al., 2018b) 
Video surveillance  UCFCrime (Sultani et al., 2018)  1,900 videos (13.8m frames)  N/A  13 crimes  Video  (Sultani et al., 2018) 
System log analysis  HDFS Log (Xu et al., 2009)  11.2m  N/A  2.90%  Sequence  (Du et al., 2017) 
System log analysis  OpenStack log  1.3m  N/A  7.00%  Sequence  (Du et al., 2017) 
8. Conclusions and Future Opportunities
In this work we categorize the current deep anomaly detection methods into three highlevel categories and 11 finegrained categories, representing 12 diverse modeling perspectives on harnessing deep learning techniques for anomaly detection. We also discuss how these methods address some notorious anomaly detection challenges to demonstrate the importance of deep anomaly detection. Through such a review, we identify some exciting opportunities as follows.
8.1. Exploring Anomalysupervisory Signals
Informative supervisory signals are the key for deep anomaly detection to learn expressive representations of normality/abnormality or anomaly scores and reduce false positives. While a wide range of unsupervised or selfsupervised supervisory signals have been explored, as discussed in Section 5.1, to learn the representations, a key issue for these formulations is that their objective functions are generic but not optimized specifically for anomaly detection. Anomaly measuredependent feature learning in Section 5.2 helps address this issue by imposing constraints derived from traditional anomaly measures. However, these constraints can have some inherent limitations, e.g., implicit assumptions in the anomaly measures. It is critical to explore new sources of anomalysupervisory signals that lie beyond the widelyused formulations such as data reconstruction and GANs, and have weak assumptions on the anomaly distribution. Another possibility is to develop domaindriven anomaly detection by leveraging domain knowledge (Cao et al., 2010) such as applicationspecific knowledge of anomaly and/or expert rules as the supervision source.
8.2. Deep Weaklysupervised Anomaly Detection
Deep weaklysupervised anomaly detection aims at leveraging deep neural networks to learn anomalyinformed detection models with some weaklysupervised anomaly signals, e.g., partially/inexactly/inaccurately labeled anomaly data. This labeled data provides important knowledge of anomaly and can be a major driving force to lift detection recall rates (Pang et al., 2018a, 2019b, 2019a; Sultani et al., 2018; Tamersoy et al., 2014). One exciting opportunity is to utilize a small number of accurate labeled anomaly examples to enhance detection models as they are often available in realworld applications, e.g., some intrusions/frauds from deployed detection systems/endusers and verified by human experts. However, since anomalies can be highly heterogeneous, there can be unknown/novel anomalies that lie beyond the span set of the given anomaly examples. Thus, one important direction here is unknown anomaly detection, in which we aim to build detection models that are generalized from the limited labeled anomalies to unknown anomalies. Some recent studies (Pang et al., 2019b; Ruff et al., 2020; Pang et al., 2019a) show that deep detection models are able to learn abnormality that lie beyond the scope of the given anomaly examples. It would be important to further understand and explore the extent of the generalizability and to develop models to further improve the accuracy performance.
To detect anomalies that belong to the same classes of the given anomaly examples can be as important as the detection of novel/unknown anomalies. Thus, another important direction is to develop dataefficient anomaly detection or fewshot anomaly detection, in which we aim at learning highly expressive representations of the known anomaly classes given only limited anomaly examples (Pang et al., 2018a, 2019b; Tian et al., 2020; Pang et al., 2019a). It should be noted that the limited anomaly examples may come from different anomaly classes and thus exhibit completely different manifold/class features. This scenarios is fundamentally different from the general fewshot learning (Wang et al., [n.d.]), in which the limited examples are classspecific and assumed to share the same manifold/class structure. Additionally, as shown in Table 1, the network architectures are mostly not as deep as that in other machine learning tasks. This may be partially due to the limitation of the labeled training data size. It is important to explore the possibility of leveraging those small labeled data to learn more powerful detection models with deeper architectures.
Also, inexact or inaccurate (e.g., coarsegrained) anomaly labels are often inexpensive to collect in some applications (Sultani et al., 2018); learning deep detection models with this weak supervision is important in these scenarios.
8.3. Largescale Normality Learning
Largescale unsupervised/selfsupervised representation learning has gained tremendous success in enabling downstream learning tasks (Peters et al., 2018; Devlin et al., 2018). This is particular important for learning tasks, in which it is difficult to obtain sufficient labeled data, such as anomaly detection (see Section 2.1). The goal is to learn transferable pretrained representation models from largescale unlabeled data in an unsupervised/selfsupervised mode and finetune detection models in a semisupervised mode. The selfsupervised classificationbased methods in Section 5.1.3 may provide some initial sources of supervision for the normality learning. However, precautions must be taken to ensure that (i) the unlabeled data is free of anomaly contamination and/or (ii) the representation learning methods are robust w.r.t. possible anomaly contamination. This is because most methods in Sections 5 implicitly assume that the training data is clean and does not contain any noise/anomaly instances. This robustness is important in both the pretrained modeling and the finetuning stage. Additionally, anomalies and datasets in different domains vary significantly, so the largescale normality learning may need to be domain/applicationspecific.
8.4. Deep Detection of Complex Anomalies
Most of existing deep anomaly detection methods focus on point anomalies, showing substantially better performance than traditional methods. However, deep models for conditional/group anomalies have been significantly less explored. Deep learning has superior capability in capturing complex temporal/spatial dependence and learning representations of a set of unordered data points; it is important to explore whether deep learning could also gain similar advantages over traditional approaches in detecting such complex anomalies. Novel neural network layers or objectives functions may be required.
Similar to traditional methods, current deep anomaly detection mainly focus on single data sources. Multimodal anomaly detection is a largely unexplored research area. It is difficult for traditional approaches to bridge the gap presented by those multimodal data. Deep learning has demonstrated tremendous success in learning feature representations from different types of raw data for anomaly detection (Ding et al., 2019; Ionescu et al., 2019; Lu et al., 2017; Pang et al., 2018a; Sabokrou et al., 2018); it is also able to concatenate the representations from different data sources to learn unified representations (Goodfellow et al., 2016), so deep approaches present important opportunities of multimodal anomaly detection.
8.5. Interpretable and Actionable Deep Anomaly Detection
Current deep anomaly detection research mainly focuses on the detection accuracy aspect. Interpretable deep anomaly detection and actionable deep anomaly detection are essential for understanding model decisions and results, mitigating any potential bias/risk against human users, and enabling decisionmaking actions. In recent years, there have been some studies (Angiulli et al., 2009; Duan et al., 2015; Vinh et al., 2016; Angiulli et al., 2017; Siddiqui et al., 2019) that explore the anomaly explanation issues by searching for a subset of features that makes a reported anomaly most abnormal. The abnormal feature selection methods (Pang et al., 2016; Azmandian et al., 2012; Pang et al., 2017) may also be utilized for anomaly explanation purpose. The anomalous feature searching in these methods is independent from the anomaly detection methods, and thus, may be used to provide explanation of anomalies identified by any anomaly detection methods, including deep models. On the other hand, this modelagnostic approach may render the explanation less useful, because they cannot provide a genuine understanding of the mechanisms underlying specific detection models, resulting in potential modeling bias/risk and weak interpretability and actionability (e.g., quantifying the impact of detected anomalies and mitigation actions). Deep models with inherent capability to provide anomaly explanation is important, such as (Pang et al., 2020). To achieve this, methods for deep model explanation (Du et al., 2019) and actionable knowledge discovery (Cao et al., 2010) could be explored with deep anomaly detection models.
8.6. Novel Applications and Settings
There have been some exciting emerging research applications and problem settings, into which there could be some important opportunities of extending deep detection methods. First, outofdistribution (OOD) detection (Hendrycks and Gimpel, 2017; Lee et al., 2018; Ren et al., 2019) is another closely related area, which detects data instances that are anomalous or significantly different from the instances used in training. This is an essential technique to enable machine learning systems to deal with instances of novel classes in openworld environments. OOD detection is also an anomaly detection task, but in OOD detection it is generally assumed that finegrained normal class labels are available during training, and we need to retain the classification accuracy of these normal classes while performing accurate OOD detection. Second, curiosity learning (Pathak et al., 2017; Burda et al., 2019a; Burda et al., 2019b) aims at learning a bonus reward function in reinforcement learning with sparse rewards. Particularly, reinforcement learning algorithms often fail to work in an environment with very sparse rewards. Curiosity learning addresses this problem by augmenting the environment with a bonus reward in addition to the original sparse rewards from the environment. This bonus reward is defined typically based on the novelty or rarity of the states, i.e., the agent receives large bonus rewards if it discovers novel/rare states. The novel/rare states are concepts similar to anomalies. Therefore, it would be interesting to explore how deep anomaly detection could be utilized to enhance this challenging reinforcement learning problem; conversely, there can be opportunities to leverage curiosity learning techniques for anomaly detection, such as the method in (Wang et al., 2020). Third, most shallow and deep models for anomaly detection assume that the abnormality of data instances is independent and identically distributed (IID), while the abnormality in real applications may suffer from some nonIID characteristics, e.g., the abnormality of different instances/features is interdependent and/or heterogeneous (Pang, 2019). For example, the abnormality of multiple synchronized disease symptoms is mutually reinforced in early detection of diseases. This requires nonIID anomaly detection (Pang, 2019) that is dedicated to learning such nonIID abnormality. This task is crucial in complex scenarios, e.g., where anomalies have only subtle deviations and are hidden in the data space if not considering these nonIID abnormality characteristics. Lastly, other interesting applications include detection of adversarial examples (Grosse et al., 2017; Paudice et al., 2018), antispoofing in biometric systems (PérezCabo et al., 2019; Fatemifar et al., 2019), and early detection of rare catastrophic events (e.g., financial crisis (Cao and Cao, 2015) and other black swan events (Aven, 2016)).
References
 (1)

Abati et al. (2019)
Davide Abati, Angelo
Porrello, Simone Calderara, and Rita
Cucchiara. 2019.
Latent space autoregression for novelty detection. In
CVPR. 481–490.  Aggarwal (2017) Charu C Aggarwal. 2017. Outlier analysis. Springer.
 Akcay et al. (2018) Samet Akcay, Amir AtapourAbarghouei, and Toby P Breckon. 2018. GANomaly: Semisupervised anomaly detection via adversarial training. In ACCV. Springer, 622–637.
 Akoglu et al. (2015) Leman Akoglu, Hanghang Tong, and Danai Koutra. 2015. Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery 29, 3 (2015), 626–688.
 Aljalbout et al. (2018) Elie Aljalbout, Vladimir Golkov, Yawar Siddiqui, Maximilian Strobel, and Daniel Cremers. 2018. Clustering with deep learning: Taxonomy and new methods. arXiv preprint arXiv:1801.07648 (2018).
 Andrews et al. (2016) J Andrews, Thomas Tanay, Edward J Morton, and Lewis D Griffin. 2016. Transfer representationlearning for anomaly detection. In PMLR.
 Angiulli et al. (2017) Fabrizio Angiulli, Fabio Fassetti, Giuseppe Manco, and Luigi Palopoli. 2017. Outlying property detection with numerical attributes. Data mining and knowledge discovery 31, 1 (2017), 134–163.
 Angiulli et al. (2009) Fabrizio Angiulli, Fabio Fassetti, and Luigi Palopoli. 2009. Detecting outlying properties of exceptional objects. ACM Transactions on Database Systems 34, 1 (2009), 1–62.
 Angiulli and Pizzuti (2002) Fabrizio Angiulli and Clara Pizzuti. 2002. Fast outlier detection in high dimensional spaces. In PKDD. Springer, 15–27.
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein Generative Adversarial Networks. In ICML. 214–223.
 Aven (2016) Terje Aven. 2016. Risk assessment and risk management: Review of recent advances on their foundation. European Journal of Operational Research 253, 1 (2016), 1–13.
 Azmandian et al. (2012) Fatemeh Azmandian, Ayse Yilmazer, Jennifer G Dy, Javed A Aslam, and David R Kaeli. 2012. GPUaccelerated feature selection for outlier detection using the local kernel density ratio. In ICDM. IEEE, 51–60.
 Bache and Lichman (2013) Kevin Bache and Moshe Lichman. 2013. UCI machine learning repository. URL http://archive. ics. uci. edu/ml 901 (2013).
 Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798–1828.
 Bergmann et al. (2019) Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. 2019. MVTec AD–A Comprehensive RealWorld Dataset for Unsupervised Anomaly Detection. In CVPR. 9592–9600.
 Boukerche et al. (2020) Azzedine Boukerche, Lining Zheng, and Omar Alfandi. 2020. Outlier Detection: Methods, Models and Classifications. Comput. Surveys (2020).
 Breunig et al. (2000) Markus M Breunig, HansPeter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: Identifying densitybased local outliers. ACM SIGMOD Record 29, 2 (2000), 93–104.
 Burda et al. (2019a) Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. 2019a. Largescale study of curiositydriven learning. In ICLR.
 Burda et al. (2019b) Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. 2019b. Exploration by random network distillation. In ICLR.
 Campos et al. (2016) Guilherme O Campos, Arthur Zimek, Jörg Sander, Ricardo JGB Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E Houle. 2016. On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30, 4 (2016), 891–927.
 Candès et al. (2011) Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. 2011. Robust principal component analysis? J. ACM 58, 3 (2011), 1–37.
 Cao (2015) Longbing Cao. 2015. Coupling learning of complex interactions. Information Processing & Management 51, 2 (2015), 167–186.
 Cao et al. (2010) Longbing Cao, Philip S Yu, Chengqi Zhang, and Yanchang Zhao. 2010. Domain Driven Data Mining. Springer.
 Cao and Cao (2015) Wei Cao and Longbing Cao. 2015. Financial Crisis Forecasting via Coupled Market State Analysis. IEEE Intelligent Systems 30, 2 (2015), 18–25.

Caron
et al. (2018)
Mathilde Caron, Piotr
Bojanowski, Armand Joulin, and Matthijs
Douze. 2018.
Deep clustering for unsupervised learning of visual features. In
ECCV. 132–149.  Chalapathy and Chawla (2019) Raghavendra Chalapathy and Sanjay Chawla. 2019. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407 (2019).
 Chalapathy et al. (2018) Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. 2018. Anomaly detection using oneclass neural networks. arXiv preprint arXiv:1802.06360 (2018).
 Chandola et al. (2009) Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. Comput. Surveys 41, 3 (2009), 15.
 Chen et al. (2017) Jinghui Chen, Saket Sathe, Charu Aggarwal, and Deepak Turaga. 2017. Outlier detection with autoencoder ensembles. In SDM. SIAM, 90–98.
 Chen et al. (2016) Ting Chen, LuAn Tang, Yizhou Sun, Zhengzhang Chen, and Kai Zhang. 2016. Entity embeddingbased anomaly detection for heterogeneous categorical events. In IJCAI. 1396–1403.
 Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. 1995. Supportvector networks. Machine Learning 20, 3 (1995), 273–297.
 Creswell et al. (2018) Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. 2018. Generative adversarial networks: An overview. IEEE Signal Processing Magazine 35, 1 (2018), 53–65.

Dai
et al. (2017)
Zihang Dai, Zhilin Yang,
Fan Yang, William W Cohen, and
Russ R Salakhutdinov. 2017.
Good semisupervised learning that requires a bad gan. In
NeurIPS. 6510–6520.  Dal Pozzolo et al. (2017) Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi. 2017. Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE Transactions on Neural Networks and Learning Systems 29, 8 (2017), 3784–3797.

Danziger et al. (2009)
Samuel A Danziger, Roberta
Baronio, Lydia Ho, Linda Hall,
Kirsty Salmon, G Wesley Hatfield,
Peter Kaiser, and Richard H Lathrop.
2009.
Predicting positive p53 cancer rescue regions using Most Informative Positive (MIP) active learning.
PLoS Computational Biology 5, 9 (2009).  Devlin et al. (2018) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
 Dilokthanakul et al. (2017) Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. 2017. Deep unsupervised clustering with gaussian mixture variational autoencoders. In ICLR.
 Ding et al. (2019) Kaize Ding, Jundong Li, Rohit Bhanushali, and Huan Liu. 2019. Deep anomaly detection on attributed networks. In SDM. SIAM, 594–602.
 Doersch (2016) Carl Doersch. 2016. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016).
 Donahue et al. (2017) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. 2017. Adversarial feature learning. In ICLR.
 Du et al. (2017) Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In CCS. 1285–1298.
 Du et al. (2019) Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for interpretable machine learning. Commun. ACM 63, 1 (2019), 68–77.
 Duan et al. (2015) Lei Duan, Guanting Tang, Jian Pei, James Bailey, Akiko Campbell, and Changjie Tang. 2015. Mining outlying aspects on numeric data. Data Mining and Knowledge Discovery 29, 5 (2015), 1116–1151.
 Emmott et al. (2013) Andrew F Emmott, Shubhomoy Das, Thomas Dietterich, Alan Fern, and WengKeen Wong. 2013. Systematic construction of anomaly detection benchmarks from real data. In KDD Workshop. 16–21.
 Erfani et al. (2016) Sarah M Erfani, Sutharshan Rajasegarar, Shanika Karunasekera, and Christopher Leckie. 2016. Highdimensional and largescale anomaly detection using a linear oneclass SVM with deep learning. Pattern Recognition 58 (2016), 121–134.
 Fan et al. (2018) Shaohua Fan, Chuan Shi, and Xiao Wang. 2018. Abnormal event detection via heterogeneous information network embedding. In CIKM. 1483–1486.
 Fatemifar et al. (2019) Soroush Fatemifar, Shervin Rahimzadeh Arashloo, Muhammad Awais, and Josef Kittler. 2019. Spoofing attack detection by anomaly detection. In ICASSP. IEEE, 8464–8468.
 Ghasedi Dizaji et al. (2017) Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. 2017. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In ICCV. 5736–5745.
 Golan and ElYaniv (2018) Izhak Golan and Ran ElYaniv. 2018. Deep anomaly detection using geometric transformations. In NeurIPS. 9758–9769.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
 Gregor et al. (2014) Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. 2014. Deep AutoRegressive Networks. In ICML. 1242–1250.
 Grosse et al. (2017) Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. 2017. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280 (2017).
 Grubbs (1969) Frank E Grubbs. 1969. Procedures for detecting outlying observations in samples. Technometrics 11, 1 (1969), 1–21.
 Gupta et al. (2013) Manish Gupta, Jing Gao, Charu C Aggarwal, and Jiawei Han. 2013. Outlier detection for temporal data: A survey. IEEE Transactions on Knowledge and Data Engineering 26, 9 (2013), 2250–2267.
 Gutmann and Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. 2010. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS. 297–304.
 Hadsell et al. (2006) R. Hadsell, S. Chopra, and Y. LeCun. 2006. Dimensionality Reduction by Learning an Invariant Mapping. In CVPR, Vol. 2. 1735–1742.
 Hasan et al. (2016) Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K RoyChowdhury, and Larry S Davis. 2016. Learning temporal regularity in video sequences. In CVPR. 733–742.
 Hawkins et al. (2002) Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. 2002. Outlier detection using replicator neural networks. In DaWaK.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
 He et al. (2003) Zengyou He, Xiaofei Xu, and Shengchun Deng. 2003. Discovering clusterbased local outliers. Pattern Recognition Letters 24, 910 (2003), 1641–1650.
 Hendrycks and Gimpel (2017) Dan Hendrycks and Kevin Gimpel. 2017. A baseline for detecting misclassified and outofdistribution examples in neural networks. In ICLR.
 Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), 504–507.
 Hodge and Austin (2004) Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22, 2 (2004), 85–126.
 Hsieh et al. (2018) JunTing Hsieh, Bingbin Liu, DeAn Huang, Li F FeiFei, and Juan Carlos Niebles. 2018. Learning to decompose and disentangle representations for video prediction. In NeurIPS. 517–526.
 Huang et al. (2003) Yian Huang, Wei Fan, Wenke Lee, and Philip S Yu. 2003. Crossfeature analysis for detecting adhoc routing anomalies. In ICDCS. IEEE, 478–487.
 Ionescu et al. (2019) Radu Tudor Ionescu, Fahad Shahbaz Khan, MarianaIuliana Georgescu, and Ling Shao. 2019. Objectcentric autoencoders and dummy anomalies for abnormal event detection in video. In CVPR. 7842–7851.
 Jiang et al. (2001) MonFong Jiang, ShianShyong Tseng, and ChihMing Su. 2001. Twophase clustering process for outliers detection. Pattern recognition letters 22, 67 (2001), 691–700.
 Jiang et al. (2006) ShengYi Jiang, Xiaoyu Song, Hui Wang, JianJun Han, and QingHua Li. 2006. A clusteringbased method for unsupervised intrusion detections. Pattern Recognition Letters 27, 7 (2006), 802–810.
 Jiang et al. (2014) Xinwei Jiang, Junbin Gao, Xia Hong, and Zhihua Cai. 2014. Gaussian processes autoencoder for dimensionality reduction. In PAKDD. Springer, 62–73.
 Joachims et al. (2017) Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learningtorank with biased feedback. In WSDM. 781–789.
 Keller et al. (2012) Fabian Keller, Emmanuel Muller, and Klemens Bohm. 2012. HiCS: High contrast subspaces for densitybased outlier ranking. In ICDE. IEEE, 1037–1048.
 Kieu et al. (2019) Tung Kieu, Bin Yang, Chenjuan Guo, and Christian S Jensen. 2019. Outlier detection for time series with recurrent autoencoder ensembles. In IJCAI.
 Knorr and Ng (1999) Edwin M Knorr and Raymond T Ng. 1999. Finding intensional knowledge of distancebased outliers. In VLDB, Vol. 99. 211–222.
 Knorr et al. (2000) Edwin M Knorr, Raymond T Ng, and Vladimir Tucakov. 2000. Distancebased outliers: algorithms and applications. The VLDB Journal 8, 34 (2000), 237–253.
 Kriegel et al. (2011) HansPeter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. 2011. Interpreting and unifying outlier scores. In SDM. SIAM, 13–24.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NeurIPS. 1097–1105.
 Kumar et al. (2015) Srijan Kumar, Francesca Spezzano, and VS Subrahmanian. 2015. Vews: A wikipedia vandal early warning system. In KDD. 607–616.
 Lazarevic and Kumar (2005) Aleksandar Lazarevic and Vipin Kumar. 2005. Feature bagging for outlier detection. In KDD. ACM, 157–166.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
 Lee et al. (2018) Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. 2018. Training confidencecalibrated classifiers for detecting outofdistribution samples. In ICLR.
 Li et al. (2006) Ping Li, Trevor J Hastie, and Kenneth W Church. 2006. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 287–296.
 Li et al. (2013) Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. 2013. Anomaly detection and localization in crowded scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 1 (2013), 18–32.
 Liao et al. (2018b) Binbing Liao, Jingqing Zhang, Chao Wu, Douglas McIlwraith, Tong Chen, Shengwen Yang, Yike Guo, and Fei Wu. 2018b. Deep sequence learning with auxiliary information for traffic prediction. In KDD. 537–546.
 Liao et al. (2018a) Weixian Liao, Yifan Guo, Xuhui Chen, and Pan Li. 2018a. A unified unsupervised gaussian mixture variational autoencoder for high dimensional outlier detection. In IEEE Big Data. IEEE, 1208–1217.
 Liu et al. (2012a) Fei Tony Liu, Kai Ming Ting, and ZhiHua Zhou. 2012a. Isolationbased anomaly detection. ACM Transactions on Knowledge Discovery from Data 6, 1 (2012), 3.
 Liu et al. (2012b) Fei Tony Liu, Kai Ming Ting, and ZhiHua Zhou. 2012b. Isolationbased anomaly detection. ACM Transactions on Knowledge Discovery from Data 6, 1, Article 3 (2012), 3:1–3:39 pages.
 Liu et al. (2009) TieYan Liu et al. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331.
 Liu et al. (2018b) Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018b. Future frame prediction for anomaly detection–a new baseline. In CVPR. 6536–6545.
 Liu et al. (2018c) Xialei Liu, Joost van de Weijer, and Andrew D Bagdanov. 2018c. Leveraging unlabeled data for crowd counting by learning to rank. In CVPR. 7661–7669.
 Liu et al. (2018a) Yusha Liu, ChunLiang Li, and Barnabás Póczos. 2018a. Classifier Two Sample Test for Video Anomaly Detections.. In BMVC. 71.
 Liu et al. (2019) Yezheng Liu, Zhe Li, Chong Zhou, Yuanchun Jiang, Jianshan Sun, Meng Wang, and Xiangnan He. 2019. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering (2019).
 Lu et al. (2013) Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal event detection at 150 fps in matlab. In ICCV. 2720–2727.
 Lu et al. (2017) Weining Lu, Yu Cheng, Cao Xiao, Shiyu Chang, Shuai Huang, Bin Liang, and Thomas Huang. 2017. Unsupervised sequential outlier detection with deep architectures. IEEE Transactions on Image Processing 26, 9 (2017), 4321–4330.
 Luo et al. (2017) Weixin Luo, Wen Liu, and Shenghua Gao. 2017. Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 439–444.
 Ma et al. (2009) Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. 2009. Identifying suspicious URLs: An application of largescale online learning. In ICML. ACM, 681–688.
 Mahadevan et al. (2010) Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In CVPR. IEEE, 1975–1981.
 Makhzani and Frey (2014) Alireza Makhzani and Brendan Frey. 2014. Ksparse autoencoders. In ICLR.
 Malhotra et al. (2016) Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. 2016. LSTMbased encoderdecoder for multisensor anomaly detection. arXiv preprint arXiv:1607.00148 (2016).
 Marchi et al. (2015) Erik Marchi, Fabio Vesperini, Felix Weninger, Florian Eyben, Stefano Squartini, and Björn Schuller. 2015. Nonlinear prediction with LSTM recurrent neural networks for acoustic novelty detection. In 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–7.
 Mathieu et al. (2016) Michael Mathieu, Camille Couprie, and Yann LeCun. 2016. Deep multiscale video prediction beyond mean square error. In ICLR.
 Metz et al. (2017) Luke Metz, Ben Poole, David Pfau, and Jascha SohlDickstein. 2017. Unrolled generative adversarial networks. In ICLR.
 Moustafa and Slay (2015) Nour Moustafa and Jill Slay. 2015. UNSWNB15: a comprehensive data set for network intrusion detection systems. In Military Communications and Information Systems Conference, 2015. 1–6.
 Moya et al. (1993) Mary M Moya, Mark W Koch, and Larry D Hostetler. 1993. Oneclass classifier networks for target recognition applications. Technical Report. NASA STI/Recon Technical Report N.
 Ng and Russell (2000) Andrew Y Ng and Stuart J Russell. 2000. Algorithms for Inverse Reinforcement Learning. In ICML. Morgan Kaufmann Publishers Inc., 663–670.
 Ngo et al. (2019) Cuong Phuc Ngo, Amadeus Aristo Winarto, Connie Kou Khor Li, Sojeong Park, Farhan Akram, and Hwee Kuan Lee. 2019. Fence GAN: towards better anomaly detection. arXiv preprint arXiv:1904.01209 (2019).
 Nguyen and Vien (2018) MinhNghia Nguyen and Ngo Anh Vien. 2018. Scalable and interpretable oneclass svms with deep learning and random fourier features. In ECMLPKDD. Springer, 157–172.
 Noto et al. (2012) Keith Noto, Carla Brodley, and Donna Slonim. 2012. FRaC: a featuremodeling approach for semisupervised and unsupervised anomaly detection. Data mining and knowledge discovery 25, 1 (2012), 109–133.
 of Minnesota (2020) University of Minnesota. 2020. UMN Unusual Crowd Activity data set. http://mha.cs.umn.edu/Movies/CrowdActivityAll.avi. Accessed: 20200530.
 Oh and Iyengar (2019) Minhwan Oh and Garud Iyengar. 2019. Sequential Anomaly Detection using Inverse Reinforcement Learning. In KDD. 1480–1490.
 Pang (2019) Guansong Pang. 2019. NonIID outlier detection with coupled outlier factors. Ph.D. Dissertation.
 Pang et al. (2018b) Guansong Pang, Longbing Cao, Ling Chen, Defu Lian, and Huan Liu. 2018b. Sparse modelingbased sequential ensemble learning for effective outlier detection in highdimensional numeric data. In AAAI. 3892–3899.
 Pang et al. (2016) Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2016. Unsupervised feature selection for outlier detection by modelling hierarchical valuefeature couplings. In ICDM. IEEE, 410–419.
 Pang et al. (2017) Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2017. Learning homophily couplings from nonIID data for joint feature selection and noiseresilient outlier detection. In IJCAI. 2585–2591.
 Pang et al. (2018a) Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2018a. Learning Representations of Ultrahighdimensional Data for Random Distancebased Outlier Detection. In KDD. 2041–2050.
 Pang et al. (2019b) Guansong Pang, Chunhua Shen, Huidong Jin, and Anton van den Hengel. 2019b. Deep Weaklysupervised Anomaly Detection. arXiv preprint arXiv:1910.13601 (2019).
 Pang et al. (2019a) Guansong Pang, Chunhua Shen, and Anton van den Hengel. 2019a. Deep Anomaly Detection with Deviation Networks. In KDD. 353–362.
 Pang et al. (2015) Guansong Pang, Kai Ming Ting, and David Albrecht. 2015. LeSiNN: Detecting anomalies by identifying least similar nearest neighbours. In ICDM Workshop. IEEE, 623–630.
 Pang et al. (2020) Guansong Pang, Cheng Yan, Chunhua Shen, Anton van den Hengel, and Xiao Bai. 2020. Selftrained Deep Ordinal Regression for EndtoEnd Video Anomaly Detection. In CVPR. 12173–12182.
 Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. 2017. Curiositydriven Exploration by Selfsupervised Prediction. In ICML. 2778–2787.
 Paudice et al. (2018) Andrea Paudice, Luis MuñozGonzález, Andras Gyorgy, and Emil C Lupu. 2018. Detection of adversarial training examples in poisoning attacks through anomaly detection. arXiv preprint arXiv:1802.03041 (2018).
 Perera et al. (2019) Pramuditha Perera, Ramesh Nallapati, and Bing Xiang. 2019. OCGAN: Oneclass novelty detection using gans with constrained latent representations. In CVPR. 2898–2906.
 PérezCabo et al. (2019) Daniel PérezCabo, David JiménezCabello, Artur CostaPazo, and Roberto J LópezSastre. 2019. Deep anomaly detection for generalized face antispoofing. In CVPR Workshops.
 Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACLHLT. 2227–2237.
 Pevnỳ (2016) Tomáš Pevnỳ. 2016. Loda: Lightweight online detector of anomalies. Machine Learning 102, 2 (2016), 275–304.
 Rahimi and Recht (2008) Ali Rahimi and Benjamin Recht. 2008. Random features for largescale kernel machines. In NeurIPS. 1177–1184.
 Ramaswamy et al. (2000a) Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000a. Efficient algorithms for mining outliers from large data sets. In SIGMOD. 427–438.
 Ramaswamy et al. (2000b) Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000b. Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record 29, 2 (2000), 427–438.
 Ren et al. (2019) Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. 2019. Likelihood ratios for outofdistribution detection. In NeurIPS. 14680–14691.
 Rifai et al. (2011) Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. 2011. Contractive autoencoders: explicit invariance during feature extraction. In ICML. 833–840.
 Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. Unet: Convolutional networks for biomedical image segmentation. In MICCAI. Springer, 234–241.
 Rosasco et al. (2004) Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro Verri. 2004. Are loss functions all the same? Neural Computation 16, 5 (2004), 1063–1076.
 Roth (2005) Volker Roth. 2005. Outlier detection with oneclass kernel fisher discriminants. In NeurIPS. 1169–1176.
 Ruff et al. (2018) Lukas Ruff, Nico Görnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Robert Vandermeulen, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep oneclass classification. In ICML. 4390–4399.
 Ruff et al. (2020) Lukas Ruff, Robert A Vandermeulen, Nico Görnitz, Alexander Binder, Emmanuel Müller, KlausRobert Müller, and Marius Kloft. 2020. Deep semisupervised anomaly detection. ICLR.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211–252.
 Sabokrou et al. (2018) Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. 2018. Adversarially learned oneclass classifier for novelty detection. In CVPR. 3379–3388.
 Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In NeurIPS. 2234–2242.
 Schlegl et al. (2019) Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Georg Langs, and Ursula SchmidtErfurth. 2019. fAnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Medical Image Analysis 54 (2019), 30–44.
 Schlegl et al. (2017) Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula SchmidtErfurth, and Georg Langs. 2017. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In IPMI. Springer, Cham, 146–157.
 Schölkopf et al. (2001) Bernhard Schölkopf, John C Platt, John ShaweTaylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a highdimensional distribution. Neural Computation 13, 7 (2001), 1443–1471.
 Schölkopf et al. (1997) Bernhard Schölkopf, Alexander Smola, and KlausRobert Müller. 1997. Kernel principal component analysis. In ICANN. Springer, 583–588.
 Schubert et al. (2017) Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems 42, 3 (2017), 1–21.
 Siddiqui et al. (2019) Md Amran Siddiqui, Alan Fern, Thomas G Dietterich, and WengKeen Wong. 2019. Sequential feature explanations for anomaly detection. ACM Transactions on Knowledge Discovery from Data 13, 1 (2019), 1–22.
 Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for largescale image recognition. In ICLR.
 Sugiyama and Borgwardt (2013) Mahito Sugiyama and Karsten Borgwardt. 2013. Rapid distancebased outlier detection via sampling. In NeurIPS. 467–475.
 Sultani et al. (2018) Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Realworld anomaly detection in surveillance videos. In CVPR. 6479–6488.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NeurIPS. 3104–3112.
 Tamersoy et al. (2014) Acar Tamersoy, Kevin Roundy, and Duen Horng Chau. 2014. Guilt by association: Large scale malware detection by mining filerelation graphs. In KDD. 1524–1533.
 Tax and Duin (2004) David MJ Tax and Robert PW Duin. 2004. Support vector data description. Machine Learning 54, 1 (2004), 45–66.
 TenenboimChekina et al. (2013) Lena TenenboimChekina, Lior Rokach, and Bracha Shapira. 2013. Ensemble of feature chains for anomaly detection. In MCS. Springer, 295–306.
 Theis et al. (2017) Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. 2017. Lossy image compression with compressive autoencoders. In ICLR.
 Tian et al. (2014) Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and TieYan Liu. 2014. Learning deep representations for graph clustering. In AAAI. 1293–1299.
 Tian et al. (2020) Yu Tian, Gabriel Maicas, Leonardo Zorron Cheng Tao Pu, Rajvinder Singh, Johan W. Verjans, and Gustavo Carneiro. 2020. FewShot Anomaly Detection for Polyp Frames from Colonoscopy. In MICCAI.
 Tudor Ionescu et al. (2017) Radu Tudor Ionescu, Sorina Smeureanu, Bogdan Alexe, and Marius Popescu. 2017. Unmasking the abnormal events in video. In ICCV. 2895–2903.

Vincent et al. (2010)
Pascal Vincent, Hugo
Larochelle, Isabelle Lajoie, Yoshua
Bengio, and PierreAntoine Manzagol.
2010.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research 11, Dec (2010), 3371–3408.  Vinh et al. (2016) Nguyen Xuan Vinh, Jeffrey Chan, Simone Romano, James Bailey, Christopher Leckie, Kotagiri Ramamohanarao, and Jian Pei. 2016. Discovering outlying aspects in large datasets. Data Mining and Knowledge Discovery 30, 6 (2016), 1520–1555.
 Wang et al. (2020) Hu Wang, Guansong Pang, Chunhua Shen, and Congbo Ma. 2020. Unsupervised Representation Learning by Predicting Random Distances. In IJCAI.
 Wang and Yeung (2016) Hao Wang and DitYan Yeung. 2016. Towards Bayesian deep learning: A framework and some existing methods. IEEE Transactions on Knowledge and Data Engineering 28, 12 (2016), 3395–3408.
 Wang et al. (2019) Siqi Wang, Yijie Zeng, Xinwang Liu, En Zhu, Jianping Yin, Chuanfu Xu, and Marius Kloft. 2019. Effective Endtoend Unsupervised Outlier Detection via Inlier Priority of Discriminative Network. In NeurIPS. 5960–5973.
 Wang et al. ([n.d.]) Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. [n.d.]. Generalizing from a few examples: A survey on fewshot learning. Comput. Surveys ([n. d.]).
 Webb et al. (2006) Steve Webb, James Caverlee, and Calton Pu. 2006. Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically.. In CEAS.
 Wu et al. (2019) Peng Wu, Jing Liu, and Fang Shen. 2019. A Deep OneClass Neural Network for Anomalous Event Detection in Complex Scenes. IEEE Transactions on Neural Networks and Learning Systems (2019).

Xie
et al. (2016)
Junyuan Xie, Ross
Girshick, and Ali Farhadi.
2016.
Unsupervised deep embedding for clustering analysis. In
ICML. 478–487.  Xu et al. (2015) Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning Deep Representations of Appearance and Motion for Anomalous Event Detection. In British Machine Vision Conference.
 Xu et al. (2009) Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. 2009. Online system problem detection by mining patterns of console logs. In ICDM. IEEE, 588–597.
 Yang et al. (2016) Jianwei Yang, Devi Parikh, and Dhruv Batra. 2016. Joint unsupervised learning of deep representations and image clusters. In CVPR. 5147–5156.
 Yang et al. (2019) Xu Yang, Cheng Deng, Feng Zheng, Junchi Yan, and Wei Liu. 2019. Deep spectral clustering using dual autoencoder network. In CVPR. 4066–4075.
 Ye et al. (2019) Muchao Ye, Xiaojiang Peng, Weihao Gan, Wei Wu, and Yu Qiao. 2019. Anopcn: Video anomaly detection via deep predictive coding network. In ACM MM. 1805–1813.
 Yu et al. (2018) Wenchao Yu, Wei Cheng, Charu C Aggarwal, Kai Zhang, Haifeng Chen, and Wei Wang. 2018. Netwalk: A flexible deep embedding approach for anomaly detection in dynamic networks. In KDD. 2672–2681.
 Zaheer et al. (2020) Muhammad Zaigham Zaheer, Jinha Lee, Marcella Astrid, and SeungIk Lee. 2020. Old is Gold: Redefining the Adversarially Learned OneClass Classifier Training Paradigm. In CVPR. 14183–14193.
 Zenati et al. (2018a) Houssam Zenati, Chuan Sheng Foo, Bruno Lecouat, Gaurav Manek, and Vijay Ramaseshan Chandrasekhar. 2018a. Efficient ganbased anomaly detection. arXiv preprint arXiv:1802.06222 (2018).
 Zenati et al. (2018b) Houssam Zenati, Manon Romain, ChuanSheng Foo, Bruno Lecouat, and
Comments
There are no comments yet.