awesome-anomaly-analysis
Awesome Archives for Anomaly Detection
view repo
Anomaly detection, a.k.a. outlier detection, has been a lasting yet active research area in various research communities for several decades. There are still some unique problem complexities and challenges that require advanced approaches. In recent years, deep learning enabled anomaly detection, i.e., deep anomaly detection, has emerged as a critical direction. This paper reviews the research of deep anomaly detection with a comprehensive taxonomy of detection methods, covering advancements in three high-level categories and 11 fine-grained categories of the methods. We review their key intuitions, objective functions, underlying assumptions, advantages and disadvantages, and discuss how they address the aforementioned challenges. We further discuss a set of possible future opportunities and new perspectives on addressing the challenges.
READ FULL TEXT VIEW PDFAwesome Archives for Anomaly Detection
Anomaly detection, a.k.a. outlier detection, is referred to as the process of detecting data instances that significantly deviate from the majority of data instances. Anomaly detection has been an active research area for several decades, with early exploration dating back as far as to 1960s (Grubbs, 1969)
. Due to the increasing demand and broader applications in domains such as risk management, compliance, security, financial surveillance, health and medical risk, and AI safety, anomaly detection plays increasingly important roles, highlighted in various communities including data mining, machine learning, computer vision and statistics. In recent years, deep learning has shown tremendous capabilities in learning expressive representations of complex data such as high-dimensional data, temporal data, spatial data and graph data, pushing the boundaries of different learning tasks. Deep learning for anomaly detection,
deep anomaly detection for short, aim at learning feature representations or anomaly scores via neural networks for the sake of anomaly detection. In recent years, a large number of deep anomaly detection methods have been introduced, demonstrating significantly better performance than conventional anomaly detection on addressing challenging detection problems in a variety of real-world applications. This work aims to provide a comprehensive review of this area. We first discuss the problem nature and major challenges of anomaly detection, then systematically review the current deep anomaly detection methods and their capabilities in addressing these challenges, and finally presents a number of future opportunities in this area.As a popular area, a number of studies (Hodge and Austin, 2004; Chandola et al., 2009; Aggarwal, 2017; Zimek et al., 2012; Akoglu et al., 2015; Gupta et al., 2013; Boukerche et al., 2020) have been dedicated to the categorization and review of anomaly detection techniques. However, they all focus on conventional anomaly detection methods only. One work closely related to ours is (Chalapathy and Chawla, 2019). It presents a good summary of a number of real-world applications of deep anomaly detection, but only provides some very high-level outlines of selective categories of the techniques, from which it is highly difficult, if not impossible, to gain the sense of the approaches taken by the current methods and the intuition behind the methods. By contrast, to answer why we need deep anomaly detection, this review delineates the formulation of current deep detection methods to gain key insights about their underlying intuitions, inherent capabilities and weakness on addressing some largely unsolved challenges in anomaly detection. This forms a deep understanding of the problem nature and the state-of-the-art, and brings about genuine open opportunities.
In summary, this work makes the following five major contributions:
Problem nature and challenges. We discuss some unique problem complexities underlying anomaly detection and the resulting largely unsolved challenges.
Categorization and formulation
. We formulate the current deep anomaly detection methods into three principled frameworks: deep learning for generic feature extraction, learning representations of normality, and end-to-end anomaly score learning. A hierarchical taxonomy is presented to categorize the methods based on 11 different modeling perspectives.
Comprehensive literature review
. We review a large number of relevant studies in leading conferences and journals of several relevant communities, including machine learning, data mining, computer vision and artificial intelligence, to present a comprehensive literature review of the research progress. To provide an in-depth introduction, we delineate the basic assumptions, objective functions, key intuitions and their capabilities in addressing some of the aforementioned challenges by all categories of the methods.
Future opportunities. We further discuss a set of possible future opportunities and their implication to addressing relevant challenges.
Source codes and datasets. We solicit a collection of publicly accessible source codes of nearly all categories of methods and a large number of real-world datasets with real anomalies to offer some empirical comparison benchmarks.
Owing to the unique nature, anomaly detection presents distinct problem complexities from the majority of analytical and learning problems and tasks. This section summarizes such intrinsic complexities and unsolved detection challenges in complex anomaly data.
Unlike those problems and tasks on majority, regular or evident patterns, anomaly detection addresses minority, unpredictable/uncertain and rare events, leading to some unique complexities below that render general deep learning techniques ineffective.
Unknownness. Anomalies are associated with many unknowns, e.g., instances with unknown abrupt behaviors, data structures, and distributions. They remain unknown until actually occur, such as novel terrorist attacks, frauds and network intrusions.
Heterogeneous anomaly classes. Anomalies are irregular, and thus, one class of anomalies may demonstrate completely different abnormal characteristics from another class of anomalies. For example, in video surveillance, the abnormal events robbery, traffic accidents and burglary are visually highly different.
Rarity and class imbalance. Anomalies are typically rare data instances, contrasting to normal instances that often account for an overwhelming proportion of the data. Therefore, it is difficult, if not impossible, to collect a large amount of labeled abnormal instances. This results in the unavailability of large-scale labeled data in most applications. The class imbalance is also due to the fact that misclassification of anomalies is normally much more costly than that of normal instances.
Diverse types of anomaly. Three completely different types of anomaly have been explored (Chandola et al., 2009). Point anomalies are individual instances that are anomalous w.r.t. the majority of other individual instances, e.g., the abnormal health indicators of a patient. Conditional anomalies, a.k.a. contextual anomalies, also refer to individual anomalous instances but in a specific context, i.e., data instances are anomalous in the specific context, otherwise normal. The contexts can be highly different in real-world applications, e.g., sudden temperature drop/increase in a particular temporal context, or rapid credit card transactions in unusual spatial contexts. Group anomalies, a.k.a. collective anomalies, are a subset of data instances anomalous as a whole w.r.t. the other data instances; the individual members of the collective anomaly may not be anomalies, e.g., exceptionally dense subgraphs formed by fake accounts in social network are anomalies as a collection, but the individual nodes in those subgraphs can be as normal as real accounts.
The above complex problem nature leads to a number of detection challenges to traditional anomaly detection methods and widely-used general deep learning methods. Some challenges, such as scalability w.r.t. data size, have been well addressed in recent years, while the following are largely unsolved, to which deep anomaly detection can play some essential roles.
CH1: Low anomaly detection recall rate. Since anomalies are highly rare and heterogeneous, it is difficult to identify all of the anomalies. Many normal instances are wrongly reported as anomalies while true yet sophisticated anomalies are missed. Although a plethora of anomaly detection methods have been introduced over the years, the current state-of-the-art methods, especially unsupervised methods (e.g., (Breunig et al., 2000; Liu et al., 2012a)), still often incur high false positives on real-world datasets (Campos et al., 2016; Pang et al., 2019a). How to reduce false positives and enhance detection recall rates is one of the most important and yet difficult challenges, particularly for the significant expense of failing to spotting anomalies.
CH2: Anomaly detection in high-dimensional and/or not-independent data. Anomalies often exhibit evident abnormal characteristics in a low-dimensional space yet become hidden and unnoticeable in a high-dimensional space. High-dimensional anomaly detection has been a long-standing problem (Zimek et al., 2012). Performing anomaly detection in a reduced lower-dimensional space spanned by a small subset of original features or newly constructed features is a straightforward solution, e.g., in subspace-based (Lazarevic and Kumar, 2005; Liu et al., 2012b; Keller et al., 2012; Pevnỳ, 2016)
and feature selection-based methods
(Pang et al., 2017; Azmandian et al., 2012; Pang et al., 2017, 2018b). However, identifying intricate (e.g., high-order, nonlinear and heterogeneous) feature interactions and couplings (Cao, 2015) may be essential in high-dimensional data yet remains a major challenge for anomaly detection. Further, how to guarantee the new feature space preserved proper information for specific detection methods is critical to downstream accurate anomaly detection, but it is challenging due to the aforementioned unknowns and heterogeneities of anomalies. Also, it is challenging to detect anomalies from instances that may be dependent on each other such as by temporal, spatial, graph-based and other interdependency relationships (Cao, 2015; Aggarwal, 2017; Akoglu et al., 2015; Gupta et al., 2013).CH3: Data-efficient learning of normality/abnormality. Due to the difficulty and cost of collecting large-scale labeled anomaly data, fully supervised anomaly detection is often impractical as it assumes the availability of labeled training data with both normal and anomaly classes. In the last decade, major research efforts have been focused on unsupervised anomaly detection that does not require any labeled training data. However, unsupervised methods do not have any prior knowledge of true anomalies. They rely heavily on their assumption on the distribution of anomalies but fail to work in datasets where their assumption is violated. On the other hand, it is often not difficult to collect labeled normal data and some labeled anomaly data. In practice, it is often suggested to leverage such readily accessible labeled data as much as possible (Aggarwal, 2017). Thus, utilizing those labeled data to learn expressive representations of normality/abnormality is crucial for accurate anomaly detection. Semi-supervised anomaly detection, which assumes that there exists a set of labeled training data^{1}^{1}1There have been some studies that refer the methods trained with purely normal training data to be unsupervised approach. However, this setting is different from the general sense of an unsupervised setting. To avoid unnecessary confusion, following (Chandola et al., 2009; Aggarwal, 2017), these methods are referred to as semi-supervised methods hereafter., is a research direction dedicated to this problem. Another research line is weakly-supervised anomaly detection that assumes we have some labels for anomaly classes yet the class labels are partial/incomplete (i.e., they do not span the entire set of anomaly class), inexact (i.e., coarse-grained labels), or inaccurate (i.e., some given labels can be incorrect). Two major challenges are how to learn expressive normality/abnormality representations with a small amount of labeled anomaly data and how to learn detection models that are generalized to novel anomalies uncovered by the given labeled anomaly data.
CH4: Noise-resilient anomaly detection. Many weakly/semi-supervised anomaly detection methods assume the given labeled training data is clean, which can be highly vulnerable to noisy instances that are mistakenly labeled as an opposite class label. In such cases, we may use unsupervised methods instead, but this fails to utilize the genuine labeled data. Additionally, there often exists large-scale anomaly-contaminated unlabeled data. Noise-resilient models can further leverage those unlabeled data for more accurate detection. The main challenge is that the amount of noises can differ significantly from datasets and noisy instances may be irregularly distributed in the data space.
CH5: Detection of complex anomalies. Most of existing methods are for point anomalies, which cannot be used for conditional anomaly and group anomaly since they exhibit completely different behaviors from point anomalies. One main challenge here is to incorporate the concept of conditional/group anomalies into anomaly measures/models. Also, current methods mainly focus on detect anomalies from single data sources, while many applications require the detection of anomalies with multiple heterogeneous data sources, e.g., multidimensional data, graph, image, text and audio data. One main challenge is that some anomalies can be detected only when considering two or more data sources.
CH6: Anomaly explanation. In many critical domains there may be some major risks if anomaly detection models are directly used as black-box models. For example, the rare data instances reported as anomalies may lead to possible algorithmic bias against the minority groups presented in the data, such as under-represented groups in fraud detection and crime detection systems. An effective approach to mitigate this type of risk is to have anomaly explanation algorithms that provide straightforward clues about why a specific data instance is identified as anomaly. Providing such explanation can be as important as detection accuracy in some applications. However, most existing anomaly detection studies focus on devising accurate detection models only, ignoring the capability of providing explanation of the identified anomalies. To derive anomaly explanation from specific detection methods is still a largely unsolved problem, especially for complex models. Developing inherently interpretable anomaly detection models is also crucial, but it remains a main challenge to well balance the model’s interpretability and effectiveness.
Deep neural networks leverage complex compositions of linear/non-linear functions that can be represented by a computational graph to learn expressive representations (Goodfellow et al., 2016)
. Two basic building blocks of deep learning are activation functions and layers.
Activation functions determine the output of computational graph nodes (i.e., neurons in neural networks) given some inputs. They can be linear or non-linear functions. Some popular activation functions include linear, sigmoid, tanh, ReLU (Rectified Linear Unit) and its variants. A
layerin neural networks refers to a set of neurons stacked in some forms. Commonly-used layers include fully connected, convolutional & pooling, and recurrent layers. These layers can be leveraged to build different popular neural networks. For example, multilayer perceptron (MLP) networks are composed by fully connected layers, convolutional neural networks (CNN) are featured by varying groups of convolutional & pooling layers, and recurrent neural networks (RNN),
e.g., vanilla RNN, gated recurrent units (GRU) and long short term memory (LSTM), are built upon recurrent layers. See
(Goodfellow et al., 2016) for detailed introduction of these neural networks.Given a dataset with , let () be a representation space, then deep anomaly detection aims at learning a feature representation mapping function or an anomaly score learning function in a way that anomalies can be easily differentiated from the normal data instances in the or space, where both and are a neural network-enabled mapping function with hidden layers and their weight matrices . In the case of learning the feature mapping , an additional step is required to calculate the anomaly score of each data instance in the new representation space, while can directly infer the anomaly scores with raw data inputs.
To have a thorough understanding of the area, we introduce a hierarchical taxonomy to classify existing deep anomaly detection methods into three main categories and 11 fine-grained categories from the modeling perspective. An overview of the taxonomy of the methods, together with the challenges they address, is shown in Fig.
1. Specifically, deep anomaly detection consists of three conceptual paradigms - Deep Learning for Feature Extraction, Learning Feature Representations of Normality, and End-to-end Anomaly Score Learning.The procedure of each of these three frameworks is presented in Fig. 2. As shown in Fig. 2(a), deep learning and anomaly detection are fully separated in the first main category (Section 4), so deep learning techniques are used as some independent feature extractors only. The two modules are dependent on each other in some form in the second main category (Section 5) presented in Fig. 2(b), with an objective of learning expressive representations of normality. This category of methods can be further divided into two subcategories based on whether traditional anomaly measures are incorporated into their objective functions. These two subcategories encompass seven fine-grained categories of methods, with each category taking a different approach to formulate its objective function. The two modules are fully unified in the third main category (Section 6) presented in Fig. 2(c), in which the methods are dedicated to learning anomaly scores via neural networks in an end-to-end fashion. These methods are further grouped into four categories based on the formulation of neural network-enabled anomaly scoring. In the following three sections we review the methods in each of these three categories in detail and discuss how they address some of the aforementioned challenges.
This category of studies represents the most basic application of deep learning techniques to anomaly detection. It aims at leveraging deep learning to extract low-dimensional feature representations from high-dimensional and/or non-linearly separable data for downstream anomaly detection. The feature extraction and the anomaly scoring are fully disjointed and independent from each other. Thus, the deep learning components work purely as dimensionality reduction only. Formally, the approach can be represented as
(1) |
where is a deep neural network-based feature mapping function, with , and normally . An anomaly scoring method that has no connection to the feature mapping is then applied onto the new space to calculate anomaly scores.
Compared to the dimension reduction methods that are popular in anomaly detection, such as principal component analysis (PCA)
(Schölkopf et al., 1997; Zou et al., 2006; Candès et al., 2011) and random projection (Li et al., 2006; Pevnỳ, 2016; Pang et al., 2018a), deep learning techniques have been demonstrating substantially better capability in extracting semantic-rich features and non-linear feature relations (Bengio et al., 2013; Goodfellow et al., 2016).Assumptions. The feature representations extracted by deep learning models preserve the discriminative information that helps separate anomalies from normal instances.
One research line is to directly uses popular effective pre-trained deep learning models, such as AlexNet (Krizhevsky et al., 2012), VGG (Simonyan and Zisserman, 2015) and ResNet (He et al., 2016), to extract low-dimensional features. This line is explored in anomaly detection in complex high-dimensional data such as image data and video data. One interesting work of this line is the unmasking framework for online anomaly detection (Tudor Ionescu et al., 2017). The key idea of the framework is to iteratively train a binary classifier to separate one set of video frames from its subsequent video frames in a sliding window, with the most discriminant features removed in each iteration step. This is analogous to an unmasking process. The framework assumes the first set of video frames as normal and evaluates its separability from its subsequent video frames. Thus, the training classification accuracy is expected to be high if the subsequent video frames are abnormal, and low otherwise. The unmasking is an anomaly scoring process, with the change of the training accuracy used to define the anomaly scores. Clearly the power of the unmasking framework relies heavily on the quality of the features, so it is essential to have quality features to represent the video frames. The VGG model pre-trained on the ILSVRC benchmark (Russakovsky et al., 2015) is shown to be effective to extract expressive appearance features for this purpose (Tudor Ionescu et al., 2017). In (Liu et al., 2018a), the masking framework is formulated as a two-sample test task to understand its theoretical foundation. They also show that using features extracted from a dynamically updated sampling pool of video frames is found to improve the performance of the framework. Additionally, similar to other tasks like classification, the feature representations extracted from the deep models pre-trained on a source dataset can be transferred to fine-tune a anomaly detector on a target dataset. As shown in (Andrews et al., 2016)
, one-class support vector machines (SVM) can be first initialized with the features extracted from the VGG models pre-trained on the ILSVRC benchmark and then fine-tuned to improve anomaly classification on the MNIST data
(LeCun et al., 1998).Another research line in this category is to explicitly train a deep feature extraction model rather than a pre-trained model for the downstream anomaly scoring
(Xu et al., 2015; Erfani et al., 2016; Ionescu et al., 2019; Yu et al., 2018). Particularly, in (Xu et al., 2015), three separate autoencoder networks are trained to learn low-dimensional features for respective appearance, motion, and appearance-motion joint representations for video anomaly detection. An ensemble of three one-class SVMs is independently trained on each of these learned feature representations to perform anomaly scoring. Similar to
(Xu et al., 2015), a linear one-class SVM is used to enable anomaly detection on low-dimensional representations of high-dimensional tabular data yielded by deep belief networks (DBNs)
(Erfani et al., 2016). Instead of one-class SVM, unsupervised classification approaches are used in (Ionescu et al., 2019) to enable anomaly scoring in the projected space. Specially, they first cluster the low-dimensional features of video frames yielded by convolutional autoencoders and then treat the cluster labels as pseudo class labels and perform one-vs-the-rest classification to calculate the anomaly scores of frames. Similar approaches can also be found in graph anomaly detection (Yu et al., 2018), in which unsupervised clustering-based anomaly measures are used in the latent representation space to calculate the abnormality of graph vertices or edges. To learn expressive representations of graph vertices, the vertex representations are optimized by minimizing autoencoder-based reconstruction loss and pairwise distances of neighbored graph vertices in the representation space, with one-hot encoding of graph vertices as inputs.
Advantages. The advantages of this group of methods are as follows. (i) A large number of state-of-the-art (pre-trained) deep models and off-the-shelf anomaly detection methods are readily available. (ii) Deep feature extraction offers more powerful dimensionality reduction than popular linear methods. (iii) It is easy to implement such methods given the public availability of the deep models and detection methods.
Disadvantages. Their disadvantages are as follows. (i) The fully disjointed feature extraction and anomaly scoring often lead to suboptimal anomaly scores. (ii) Pre-trained deep models are typically limited to specific types of data.
Challenges Targeted. This category of methods projects high-dimensional/non-independent data onto substantially lower-dimensional space, enabling existing anomaly detection methods to work on simpler data space. The lower-dimensional space often helps reveal hidden anomalies and reduces false positives (CH2). However, it should be noted that these methods may not preserve sufficient information in the projected space specifically for anomaly detection as the data projection is fully decoupled with anomaly detection. In addition, this approach allows us to leverage multiple types of features and learn semantic-rich detection models (e.g., various predefined image/video features in (Xu et al., 2015; Tudor Ionescu et al., 2017; Ionescu et al., 2019)), which also helps reduce false positives (CH1).
This section reviews the models from the perspective of normality learning. The deep anomaly detection methods in this category couple feature learning with anomaly scoring in some ways, which are different from the methods in the last section that fully decouple these two modules. These methods generally fall into two groups: generic feature learning and anomaly measure-dependent feature learning. Below we discuss these two types of approaches in detail.
This category of methods learns the representations of data instances by optimizing a generic feature learning objective function that is not primarily designed for anomaly detection, but the learned representations can still empower the anomaly detection since they are forced to capture some key underlying data regularities. Formally, this framework can be represented as
(2) | |||
(3) |
where maps the original data onto the representation space , parameterized by is a surrogate learning task that operates on the space and is dedicated to enforcing the learning of underlying data regularities,
is a loss function relative to the underlying modeling approach, and
is an anomaly scoring function that utilizes these two functions with the trained parameters and to calculate the anomaly score .This approach include methods that are driven by several perspectives, including data reconstruction, generative modeling, predictability modeling and self-supervised classification.
This type of approach aims to learn some low-dimensional feature representation space on which the given data instances can be well reconstructed. This is a widely-used technique for data compression or dimension reduction (Hinton and Salakhutdinov, 2006; Jiang et al., 2014; Theis et al., 2017)
. The heuristic for using this technique in anomaly detection is that the learned feature representations are enforced to learn important regularities of the data to minimize reconstruction errors; anomalies are difficult to be reconstructed from the resulting representations and thus have large reconstruction errors.
Assumptions. Normal data instances can be better restructured from compressed feature space than anomalies.
Autoencoder (AE) networks are the commonly-used techniques in this category. An AE is composed of an encoding network and an decoding network. The encoder maps the original data onto low-dimensional feature space, while the decoder attempts to recover the data from the projected low-dimensional space. The parameters of these two networks are learned with a reconstruction loss function. A bottleneck network architecture is often used to obtain low-dimensional representations than the original data, which forces the model to retain the information that is important in reconstructing the data instances. To minimize the overall reconstruction error, the retained information is required to be as much relevant as possible to the dominant instances, e.g., the normal instances. As a result, the data instances such as anomalies which deviate from the majority of the data are poorly reconstructed. The data reconstruction error therefore well fits the anomaly score. The basic formulation of this approach is given as follows.
(4) | |||
(5) | |||
(6) | |||
(7) |
where is the encoding network with the parameters and is the decoding network with the parameters . The encoder and the decoder can share the same weight parameters to reduce parameters and regularize the learning. is a reconstruction error-based anomaly score of .
Several types of regularized autoencoders have been introduced to learn richer and more expressive feature representations (Makhzani and Frey, 2014; Vincent et al., 2010; Rifai et al., 2011; Doersch, 2016). Particularly, sparse AE is trained in a way that encourages sparsity in the activation units of the hidden layer, e.g., by keeping the top- most active units (Makhzani and Frey, 2014). Denoising AE (Vincent et al., 2010) aims at learning representations that are robust to small variations by learning to reconstruct data from some predefined corrupted data instances rather than original data. Contractive AE (Rifai et al., 2011) takes a step further to learn feature representations that are robust to small variations of the instances around their neighbors, which is achieved by adding a penalty term based on the Frobenius norm of the Jacobian matrix of the encoder’s activations. Variational AE (Doersch, 2016) instead introduces regularization into the representation space by encoding data instances using a prior distribution over the latent space, preventing overfitting and ensuring some good properties of the learned space for enabling generation of meaningful data instances.
AEs are easy-to-implement and have straightforward intuition in detecting anomalies. As a result, they have been widely explored in the literature. Replicator neural network (Hawkins et al., 2002) is the first piece of work in exploring the idea of data reconstruction to detect anomalies, with experiments focused on static multidimensional/tabular data. The Replicator network is built upon a feed-forward multi-layer perceptron with three hidden layers. It uses parameterized
hyperbolic tangent activation functions to obtain different activation levels for different input values, which helps discretize the intermediate representations into some predefined bins. As a result, the hidden layers naturally cluster the data instances into a number of groups, enabling the detection of clustered anomalies. After this work there have been a number of studies dedicated to further enhance the performance of AEs in anomaly detection. For instance, RandNet
(Chen et al., 2017) further enhances the basic AEs by learning an ensemble of AEs. In RandNet, a set of independent AEs are trained, with each AE having some randomly selected constant dropout connections. An adaptive sampling strategy is used by exponentially increasing the sample size of the mini-batches. RandNet is focused on tabular data. The idea of autoencoder ensembles is extended to time series data in (Kieu et al., 2019). Motivated by robust principal component analysis (RPCA), RDA (Zhou and Paffenroth, 2017) attempts to improve the robustness of AEs by iteratively decomposing the original data into two subsets, normal instance set and anomaly set. This is achieved by adding a sparsity penalty or grouped penalty into its RPCA-alike objective function to regularize the coefficients of the anomaly set.AEs are also widely leveraged to detect anomalies in data other than tabular data, such as sequence data (Lu et al., 2017), graph data (Ding et al., 2019) and image/video data (Xu et al., 2015). In general, there are two types of adaptions of AEs to those complex data. The most straightforward way is to follow the same procedure as the conventional use of AEs with the exception that a particular network architecture tailored for a specific type of data is required to learn effective low-dimensional feature representations, such as CNN-AE (Hasan et al., 2016; Zhang et al., 2019), LSTM-AE (Malhotra et al., 2016), Conv-LSTM-AE (Luo et al., 2017) and GCN (graph convolutional network)-AE (Ding et al., 2019). This type of AEs embeds the encoder-decoder scheme into the full procedure of these methods. Another type of AE-based approaches is to first use AEs to learn low-dimensional representations of the complex data and then learn to predict these learned representations. The learning of AEs and representation prediction is often two separate steps. These approaches are different from the first type of approaches in that the prediction of representations are wrapped around the low-dimensional representations yielded by AEs. For example, in (Lu et al., 2017), denoising AE is combined with RNNs to learn normal patterns of multivariate sequence data, in which a denoising AE wtih two hidden layers is first used to learn representations of multidimensional data inputs in each time step and a RNN with a simple single hidden layer is then trained to predict the representations yielded by the denoising AE. A similar approach is also used for detecting acoustic anomalies (Marchi et al., 2015), in which a more complex RNN, bidirectional LSTMs, is used.
Advantages. The advantages of data reconstruction-based methods are as follows. (i) The idea of AEs is straightforward and generic to different types of data. (ii) Different types of powerful AE variants can be leveraged to perform anomaly detection.
Disadvantages. Their disadvantages are as follows. (i) The learned feature representations can be biased by infrequent regularities and the presence of outliers or anomalies in the training data. (ii) The objective function of the data reconstruction is designed for dimension reduction or data compression, rather than anomaly detection. As a result, the resulting representations are a generic summarization of underlying regularities, which are not optimized for detecting irregularities.
Challenges Targeted. Different types of neural network layers and architectures can be used under the AE framework, allowing us to detect anomalies in high-dimensional data, as well as non-independent data such as attributed graph data (Ding et al., 2019) and multivariate sequence data (Marchi et al., 2015; Lu et al., 2017) (CH2). These methods may reduce false positives over traditional methods built upon handcrafted features if the learned representations are more expressive (CH1). AEs are generally vulnerable to data noise presented in the training data as they can be trained to remember those noise, leading to severe overfitting and small reconstruction errors of anomalies. The idea of RPCA may be used into AEs to train more robust detection models (Zhou and Paffenroth, 2017) (CH4).
GAN-based anomaly detection emerges quickly as one of the popular deep anomaly detection approaches after its early use in (Schlegl et al., 2017). This approach generally aims to learn a latent feature space of the generative network so that the latent space well captures the normality underlying the given data. Some form of residual between the real instance and the generated instance are then defined as anomaly score.
Assumption. Normal data instances can be better generated than anomalies from the latent feature space of the generative network in GANs.
One of the early methods is AnoGAN (Schlegl et al., 2017). The key intuition is that, given any data instances , it aims to search for an instance in the learned latent feature space of the generative network so that the corresponding generated instance and are as similar as possible. Since the latent space is enforced to capture the underlying distribution of training data, anomalies are expected to be less likely to have highly similar generated counterparts than normal instances. Specifically, a GAN is first trained with the following conventional objective:
(8) |
where and are respectively the generator and discriminator networks parameterized by and (the parameters are omitted for brevity), and is the value function of the two-player minimax game. After that, for each , to find its best , two loss functions, namely residual loss and discrimination loss, are used to guide the search. The residual loss is defined as
(9) |
while the discrimination loss is defined based on the feature matching technique (Salimans et al., 2016):
(10) |
where is the index of the search iteration step and is a feature mapping from an intermediate layer of the discriminator. The search starts with a randomly sampled , followed by updating based on the gradients derived from the overall loss , where
is a hyperparameter. Throughout this search process, the parameters of the trained GAN are fixed; the loss is only used to update the coefficients of
for the next iteration. The anomaly score is accordingly defined upon the similarity between and obtained at the last step :(11) |
One main issue with AnoGAN is the computational inefficiency in the iterative search of . One effective way to address this issue is to add an extra network that learns the mapping from data instances onto the latent space, i.e., an inverse of the generator, resulting in methods like EBGAN (Zenati et al., 2018a) and fast AnoGAN (Schlegl et al., 2019). These two methods share the same spirit. Here we focus on EBGAN that is built upon the bi-directional GAN (BiGAN) (Donahue et al., 2017). Particularly, in addition to the generator and discriminator , BiGAN has an encoder to map to in the latent space, and simultaneously learn the parameters of , and . Instead of discriminating and , BiGAN aims to discriminate the pair of instances from the pair :
(12) |
After the training, inspired by Eq. (11) in AnoGAN, EBGAN defines the anomaly score as:
(13) |
where and . This eliminates the need to iteratively search in AnoGAN. EBGAN is extended to a method called ALAD (Zenati et al., 2018b) by adding two more discriminators, with one discriminator trying to discriminate the pair from and another one trying to discriminate the pair from .
GANomaly (Akcay et al., 2018) further improves the generator over the previous work by changing the generator network to an encoder-decoder-encoder network and adding two more extra loss functions. The generator can be conceptually represented as: , in which is a composition of the encoder and the decoder . In addition to the commonly used feature matching loss:
(14) |
the generator includes a contextual loss and an encoding loss to generate more realistic instances:
(15) |
(16) |
The contextual loss in Eq. (15) enforces the generator to consider the contextual information of the input when generating . The encoding loss in Eq. (16) helps the generator to learn how to encode the features of the generated instances from the training data. The overall loss of the generator is then defined as
(17) |
where , and are the hyperparameters to determine the weight of each individual loss. Since the training data contains mainly normal instances, the encoders and are optimized towards the encoding of normal instances, and thus, the anomaly score can be defined as
(18) |
in which is expected to be large if is an anomaly.
There have been a number of other GANs introduced over the years such as Wasserstein GAN (Arjovsky et al., 2017) and Cycle GAN (Zhu et al., 2017). They may be used to further enhance the anomaly detection performance of the above methods, such as replacing the standard GAN with Wasserstein GAN (Schlegl et al., 2019). Another relevant research line is to adversarially learn end-to-end one-class classification models, which is categorized into the end-to-end anomaly score learning framework and discussed in Section 6.4.
Advantages. The advantages of these methods are as follows. (i) GANs have demonstrated superior capability in generating realistic instances, especially on image data, empowering the detection of abnormal instances that are poorly reconstructed from the latent space. (ii) A large number of existing GAN-based models and theories (Creswell et al., 2018) may be adapted for anomaly detection.
Disadvantages. Their disadvantages are as follows. (i) The training of GANs can suffer from multiple problems, such as failure to converge and mode collapse (Metz et al., 2017), which leads to to large difficulty in training GANs-based anomaly detection models. (ii) The generator network can be misled and generates data instances out of the manifold of normal instances, especially when the true distribution of the given dataset is complex and/or the training data contains unexpected outliers. (iii) The GANs-based anomaly scores can be suboptimal since they are built upon the generator network with the objective designed for data synthesis rather than anomaly detection.
Challenges Targeted. Similar to AEs, GAN-based anomaly detection is able to detect high-dimensional anomalies by examining the reconstruction from the learned low-dimensional latent space (CH2). When the latent space preserves important anomaly discrimination information, it helps improve detection accuracy over that in the original data space (CH1).
Predictability modeling-based anomaly detection methods learn feature representations by predicting the current data instances using the representations of the previous instances within a temporal window as the context. In this section data instances are referred to as individual elements in a sequence, e.g., video frames in a video sequence. This technique is widely used for sequence representation learning and prediction (Sutskever et al., 2014; Mathieu et al., 2016; Hsieh et al., 2018; Liao et al., 2018b). To achieve accurate predictions, the feature representations are enforced to capture the temporal/sequential and recurrent dependence within a given sequence length. Normal instances are normally adherent to such dependencies well and can be well predicted, whereas anomalies often violate those dependencies and are unpredictable. Therefore, the prediction errors, e.g., measured by mean squared errors or likelihood values, can be used to define the anomaly scores.
Assumption. Normal instances are more predictable than anomalies given some temporally dependent contexts.
This research line is popular in video anomaly detection (Liu et al., 2018b; Ye et al., 2019; Abati et al., 2019). Video sequence involves complex high-dimensional spatial-temporal features. Different constraints over appearance and motion features are needed in the prediction objective function to ensure a faithful prediction of video frames. This deep anomaly detection approach is initially explored in (Liu et al., 2018b). Formally, given a video sequence with consecutive frames , then the learning task is to use all these frames to generate a future frame so that is as close as possible to the ground truth . Its general objective function can be formulated as
(19) |
where , is the frame prediction loss measured by mean squared errors, is an adversarial loss. The popular network architecture named U-Net (Ronneberger et al., 2015) is used to instantiate the function for the frame generation. is composed by a set of three separate losses that respectively enforce the closeness between and in three key image feature descriptors: intensity, gradient and optical flow. is due to the the use of adversarial training to enhance the image generation. After training, for a given video frame
, a normalized Peak Signal-to-Noise Ratio
(Mathieu et al., 2016) based on the prediction difference is used to define the anomaly score. Under the same framework, an additional autoencoder-based reconstruction network is added in (Ye et al., 2019) to further refine the predicted frame quality, which helps to enlarge the anomaly score difference between normal and abnormal frames.Another research line in this direction is based on the autoregressive models
(Gregor et al., 2014) that assume each element in a sequence is linearly dependent on the previous elements. The autoregressive models are leveraged in (Abati et al., 2019)to estimate the density of training samples in a latent space, which helps avoid the assumption of a specific family of distributions. Specifically, given
and its latent space representation , the autoregressive model factorizes as(20) |
where ,
represents the probability mass function of
conditioned on all the previous instances and is the dimensionality size of the latent space. The objective in (Abati et al., 2019) is to jointly learn an autoencoder and a density estimation network equipped with autoregressive network layers. The overall loss can be represented as(21) |
where the first term is a reconstruction error measured by MSE while the second term is an autoregressive loss measured by the log-likelihood of the representation under an estimated conditional probability density prior. Minimizing this loss enables the learning of the features that are common and easily predictable. At the evaluation stage, the reconstruction error and the log-likelihood is combine to define the anomaly score.
Advantages. The advantages of this category of methods are as follows. (i) A number of sequence learning techniques can be adapted and incorporated into this approach. (ii) This approach enables the learning of different types of temporal and spatial dependencies.
Disadvantages. Their disadvantages are as follows. (i) This approach is limited to anomaly detection in sequence data. (ii) The sequential predictions can be computationally expensive. (iii) The learned representations may suboptimal for anomaly detection as its underlying objective is for sequential predictions rather than anomaly detection.
Challenges Targeted. This approach is particularly designed to learn expressive temporally-dependent low-dimensional representations, which helps address the false positives of anomaly detection in high-dimensional and/or temporal datasets (CH1 & CH2). The prediction here is conditioned on some elapsed temporal instances, so this category of methods is able to detect temporal context-based conditional anomalies (CH5).
This approach learns representations of normality by building self-supervised classification models and identifies instances that are inconsistent to or disagree with the classification models as anomalies. This approach is rooted in the cross-feature analysis or feature model-based anomaly detection (Huang et al., 2003; Noto et al., 2012; Tenenboim-Chekina et al., 2013). These studies evaluate the normality of data instances by their consistency/agreement with a set of predictive (classification/regression) models, with each model learns to predict one feature based on the rest of the other features. The consistency of a given test instance can be measured by the average number of correct predictions or average prediction probability (Huang et al., 2003), the log loss-based surprisal (Noto et al., 2012), or the majority voting of binary decisions (Tenenboim-Chekina et al., 2013) given the classification/regression models across all features. Unlike these studies that focus on tabular data and build the feature models using the original data, deep consistency-based anomaly detection focuses on image data and builds the predictive models by using feature transformation-based augmented data. To effectively discriminate the transformed instances, the classification models are enforced to learn features that are highly important to describe the underlying patterns of the instances presented in the training data. Therefore, normal instances generally have stronger agreements with the classification models.
Assumptions. Normal instances are more consistent to augmented self-supervised predictive models than anomalies.
This approach is initially explored in (Golan and El-Yaniv, 2018). To build the predictive models, different compositions of geometric feature transformation operations, including horizontal flipping, translations, and rotations, is first applied to a set of normal training images. A deep multi-class classification model is trained on the augmented data, treating data instances with a specific transformation operation comes from the same class, i.e., a synthetic class. At the evaluation stage, test instances are augmented with each of transformation compositions, and their normality score is defined by an aggregation of all softmax classification scores to the transformed versions of a given test instance. Its loss function is defined as
(22) |
where is a low-dimensional feature representation of instance augmented by the transformation operation type , is a multi-class classifier parameterized with , is a one-hot encoding of the synthetic class assigned to instances that are augmented using the transformation operation , and is a standard cross-entropy loss function.
By minimizing Eq. (22), we obtain the representations that are optimized for the classifier . We then can apply the feature learner and the classifier to obtain a classification score for each test instance augmented with a transformation operation . The classification scores of each test instance w.r.t. different are then aggregated to compute the anomaly score. To achieve that, the classification scores conditioned on each is assumed to follow a Dirichlet distribution in (Golan and El-Yaniv, 2018) to estimate the consistency of the test instance to the classification model . Actually, as shown in (Golan and El-Yaniv, 2018), a simple average of the classification scores associated with different works similarly well as the Dirichlet-based anomaly score.
A semi-supervised setting, i.e., training data contains normal instances only, is assumed in (Golan and El-Yaniv, 2018). A similar idea is explored in the unsupervised setting in (Wang et al., 2019), with the transformation sets containing four transformation operations, i.e., rotation, flipping, shifting and path re-arranging. Two key insights revealed in (Wang et al., 2019) is that (i) the gradient magnitude induced by normal instances is normally substantially larger than outliers during the training of such self-supervised multi-class classification models; and (ii) the network updating direction is also biased towards normal instances. As a result of these two properties, normal instances often have stronger agreement with the classification model than anomalies. Three strategies of using the classification scores to define the anomaly scores, are evaluated, including average prediction probability, maximum prediction probability, and negative entropy across all prediction probabilities (Wang et al., 2019). Their results show that the negative entropy-based anomaly scores perform generally better than the other two strategies.
Advantages. The advantages of deep consistency-based methods are as follows. (i) They work well in both the unsupervised and semi-supervised settings. (ii) Anomaly scoring is grounded by some intrinsic properties of gradient magnitude and its updating.
Disadvantages. Their disadvantages are as follows. (i) The feature transformation operations are often data-dependent. The above transformation operations are applicable to image data only. Different transformation operations need to be explored to adapt this approach to other types of data. (ii) Although the classification model is trained in an end-to-end manner, the consistency-based anomaly scores are derived upon the classification scores rather than an integrated unit in the optimization, and thus they may be suboptimal.
Challenges Targeted. The expressive low-dimensional representations of normality this approach learns help detect anomalies better than in the original high-dimensional space (CH1 & CH2). Due to some intrinsic differences between anomalies and normal instances presented in the self-supervised classifiers, this approach is also able to work in an unsupervised setting (Wang et al., 2019), demonstrating good robustness to anomaly contamination in the training data (CH4).
Anomaly measure-dependent feature learning aims at learning feature representations that are specifically optimized for one particular existing anomaly measure. Formally, the framework for this group of methods can be represented as
(23) | |||
(24) |
where is an existing anomaly scoring measure operating on the representation space. Note that whether may involve trainable parameters or not is dependent on the anomaly measure used. Different from the generic feature learning approach as in Eqs. (2-3) that calculates anomaly scores based on some heuristics after obtaining the learned representations, this research line directly incorporates an existing anomaly measure into the feature learning objective function to optimize the feature representations specifically for . Below we review representation learning specifically designed for three types of popular anomaly measures, including distance-based measure, one-class classification measure and clustering-based measure.
Deep distance-based anomaly detection aims to learn feature representations that are specifically optimized for a specific type of distance-based anomaly measures. Distance-based methods are straightforward and easy-to-implement. There have been a number of effective distance-based anomaly measures introduced, e.g., DB outliers (Knorr and Ng, 1999; Knorr et al., 2000), -nearest neighbor distance (Ramaswamy et al., 2000a, b), average -nearest neighbor distance (Angiulli and Pizzuti, 2002), relative distance (Zhang et al., 2009), and random nearest neighbor distance (Sugiyama and Borgwardt, 2013; Pang et al., 2015)
. One major limitation of these traditional distance-based anomaly measures is that they fail to work effectively in high-dimensional data due to the curse of dimensionality. Since deep distance-based anomaly detection techniques project data onto low-dimensional space before applying the distance measures, it can well overcome this limitation.
Assumption. Anomalies are distributed far from their closest neighbors while normal instances are located in dense neighborhoods.
Deep distance-based anomaly detection is first explored in (Pang et al., 2018a), in which the random nearest neighbor distance-based anomaly measure (Sugiyama and Borgwardt, 2013; Pang et al., 2015) is leveraged to drive the learning of low-dimensional representations out of ultrahigh-dimensional data. Particularly, the key idea is that the representations are optimized so that the nearest neighbor distance of pseudo-labeled anomalies in random subsamples is substantially larger than that of pseudo-labeled normal instances. The pseudo labels are generated by some off-the-shelf anomaly detection methods. Let be a subset of data instances randomly sampled from the dataset , and respectively be the pseudo-labeled anomaly and normal instance sets, with and , its loss function is built upon the hinge loss function (Rosasco et al., 2004) and can be represented as
(25) |
where is a predefined constant for the margin between two distances yielded by , which is a random nearest neighbor distance function operated in the representation space:
(26) |
is a hinge loss function augmented by the random nearest neighbor distance-based anomaly measure defined in Eq. (26). Minimizing the loss in Eq. (25) guarantees that the random nearest neighbor distance of anomalies are at least greater than that of normal instances in the -based representation space. At the evaluation stage, the random distance in Eq. (26) is used directly to obtain the anomaly score for each test instance. Following this approach, we might also derive similar representation learning tailored for other distance-based measures by replacing Eq. (26) with the other measures, such as the -nearest neighbor distance (Ramaswamy et al., 2000b) or the average -nearest neighbor distance (Angiulli and Pizzuti, 2002). However, these measures are significantly more computationally costly than the random nearest neighbor distances. Thus, one major challenging for such adaptions would be the prohibitively high computational cost.
Compared to (Pang et al., 2018a) that requires to query the nearest neighbor distances in random data subsets, inspired by (Burda et al., 2019b), a simpler idea explored in (Wang et al., 2020) uses the distance between optimized representations and randomly projected representations of the same instances to guide the representation learning. The objective of the method is as follows
(27) |
where is a random mapping function that is instantiated by the neural network used in with fixed random weights, is a measure of distance between the two representations of the same data instance. As discussed in (Burda et al., 2019b), solving Eq. (27) is equivalent to have a knowledge distillation from a random neural network and helps learn the frequency of different underlying patterns in the data. However, Eq. (27) ignores the relative proximity between data instances and is sensitive to the anomalies presented in the data. To address these two issues, an additional loss function that aims to predict the distance between random instance pairs is added in Eq. (27), and a boosting process is used during the training process to iteratively filter potential anomalies and retrain the representation learning model. At the evaluation stage, is used to compute the anomaly scores.
Advantages. The advantages of this category of methods are as follows. (i) The distance-based anomalies are straightforward and well defined with rich theoretical supports in the literature. Thus, deep distance-based anomaly detection methods can be well grounded due to the strong foundation built in previous relevant work. (ii) They work in low-dimensional representation spaces and can effectively deal with high-dimensional data that traditional distance-based anomaly measures fail. (iii) They are able to learn representations specifically tailored for themselves.
Disadvantages. Their disadvantages are as follows. (i) The extensive computation involved in most of distance-based anomaly measures may be an obstacle to incorporate distance-based anomaly measures into the representation learning process. (ii) Their capabilities may be limited by the inherent weaknesses of the distance-based anomaly measures.
Challenges Targeted. This approach is able to learn low-dimensional representations tailored for existing distance-based anomaly measures, addressing the notorious curse of dimensionality in distance-based detection (Zimek et al., 2012) (CH1 & CH2). As shown in (Pang et al., 2018a), an adapted triplet loss can be devised to utilize a few labeled anomaly examples to learn more effective normality representations (CH3). Benefiting from pseudo anomaly labeling, the methods (Pang et al., 2018a; Wang et al., 2020) are also robust to potential anomaly contamination and work effectively in the fully unsupervised setting (CH4).
This category of methods aims to learn feature representations that are customized to subsequent one-class classification-based anomaly detection measures. One-class classification is referred to as the problem of learning a description of a set of data instances to detect whether new instances conform to the training data or not. It is one of the most popular approaches for anomaly detection (Moya et al., 1993; Schölkopf et al., 2001; Tax and Duin, 2004; Roth, 2005). Most one-class classification models are inspired by Support Vector Machines (SVM) (Cortes and Vapnik, 1995), such as the two widely-used one-class models: one-class SVM (or -SVC) (Schölkopf et al., 2001) and Support Vector Data Description (SVDD) (Tax and Duin, 2004). One main research line here is to learn representations that are specifically optimized for these traditional one-class classification models such as one-class SVM and SVDD. This is the focus of this section. Another line is to learn an end-to-end adversarial one-class classification model, which will be discussed in Section 6.4.
Assumption. All normal instances come from a single (abstract) class and can be summarized by a compact model, to which anomalies do not conform.
There are a number of studies dedicated to combine one-class SVM with neural networks (Wu et al., 2019; Nguyen and Vien, 2018; Chalapathy et al., 2018)
. Conventional one-class SVM is to learn a hyperplane that maximize a margin between training data instances and the origin. The key idea of deep one-class SVM is to learn the one-class hyperplane from the neural network-enabled low-dimensional representation space rather than the original input space. Let
, then a generic formulation of the key ideas in (Wu et al., 2019; Nguyen and Vien, 2018; Chalapathy et al., 2018) can be represented as(28) |
where is the margin parameter, are the parameters of a representation network, and (i.e., ) replaces the original dot product that satisfies . Here is a RKHS (Reproducing Kernel Hilbert Space) associated mapping and is a kernel function; is a hyperparameter that can be seen as an upper bound of the fraction of the anomalies in the training data. Any instances that have can be reported as anomalies.
This formulation brings two main benefits: (i) it can leverage (pretrained) deep neural networks to learn more expressive features for downstream anomaly detection, and (iii) it also helps remove the computational expensive pairwise distance computation in the kernel function. As shown in (Wu et al., 2019; Nguyen and Vien, 2018), the reconstruction loss in AEs can be added into Eq. (28) to enhance the expressiveness of the learned representations . As shown in (Rahimi and Recht, 2008), many kernel functions can be approximated with random Fourier features. Motivated by this, before , a further mapping may be applied to to generate random Fourier features, resulting in , which may help achieve a better one-class SVM model.
Another research line studies deep learning models for SVDD (Ruff et al., 2018, 2020). SVDD aims to learn a minimum hyperplane characterized by a center and a radius so that the sphere contains all training data instances, i.e.,
(29) | |||
(30) |
Similar to Deep one-class SVM, Deep SVDD (Ruff et al., 2018) also aims to leverage neural networks to map data instances into the sphere of minimum volume, and then employs the hinge loss function to guarantee the margin between the sphere center and the projected instances. The feature learning and the SVDD objective can then be jointly trained by minimizing the following loss:
(31) |
This assume the training data contains a small proportion of anomaly contamination in the unsupervised setting. In the semi-supervised setting, the loss function can be simplified as
(32) |
where directly minimizes the mean distance between the representations of training data instances and the center . Note that including as trainable parameters in Eq. (32) can lead to trivial solutions. It is shown in (Ruff et al., 2018) that can be fixed as the mean of the feature representations yielded by performing a single initial forward pass. Deep SVDD can also be further extended to address another semi-supervised setting where a small number of both labeled normal instances and anomalies are available (Ruff et al., 2020). The key idea is to minimize the distance of labeled normal instances to the center while at the same time maximizing the distance of known anomalies to the center. This can be achieved by adding into Eq. (32), where is a labeled instance, when it is a normal instance and otherwise.
Advantages. The advantages of this category of methods are as follows. (i) The one-class classification-based anomalies are well studied in the literature and provides a strong foundation of deep one-class classification-based methods. (ii) The representation learning and one-class classification models can be unified to learn tailored and more optimal representations. (iii) They free the users from manually choosing suitable kernel functions in traditional one-class models.
Disadvantages. Their disadvantages are as follows. (i) The one-class models may work ineffectively in datasets with complex distributions within the normal class. (ii) The detection performance is dependent on the one-class classification-based anomaly measures.
Challenges Targeted. This category of methods enhances detection accuracy by learning lower-dimensional representation space optimized for one-class classification models (CH1 & CH2). A small number of labeled normal and abnormal data can be leveraged by these methods (Ruff et al., 2020) to learn more effective one-class description models, which can not only detect known anomalies but also novel classes of anomaly (CH3).
Deep clustering-based anomaly detection aims at learning representations so that anomalies are clearly deviated from the clusters in the newly learned representation space. The task of clustering and anomaly detection is naturally tied with each other, so there have been a large number of studies dedicated to using clustering results to define anomalies, e.g., cluster size (Jiang et al., 2001), distance to cluster centers (He et al., 2003), distance between cluster centers (Jiang et al., 2006), and cluster membership (Schubert et al., 2017)
. Gaussian mixture model-based anomaly detection
(Mahadevan et al., 2010; Emmott et al., 2013) is also included into this category due to some of its intrinsic relations to clustering, e.g., the likelihood fit in the Gaussian mixture model (GMM) corresponds to an aggregation of the distances of data instances to the centers of the Gaussian clusters/components (Aggarwal, 2017).Assumptions. Normal instances have stronger adherence to clusters than anomalies.
Deep clustering, which aims to learn feature representations tailored for a specific clustering algorithm, is the most critical component of this anomaly detection method. A number of studies have explored this problem in recent years (Tian et al., 2014; Xie et al., 2016; Yang et al., 2016; Dilokthanakul et al., 2017; Ghasedi Dizaji et al., 2017; Caron et al., 2018; Yang et al., 2019). The main motivation is due to the fact that the performance of clustering methods is highly dependent on the input data. Learning feature representations specifically tailored for a clustering algorithm can well guarantee its performance on different datasets (Aljalbout et al., 2018). In general, there are two key intuitions here: (i) good representations enables better clustering and good clustering results can provide effective supervisory signals to representation learning; and (ii) representations that are optimized for one clustering algorithm is not necessarily useful for other clustering algorithms due to the difference of the underlying assumptions made by the clustering algorithms.
The deep clustering methods typically consist of two modules: performing clustering in the forward pass and learning representations using the cluster assignment as pseudo class labels in the backward pass. Its loss function is often the most critical part, which can be generally formulated as
(33) |
where is a clustering loss function, within which is the feature learner parameterized by , is a clustering assignment function parameterized by and represents pseudo class labels yielded by the clustering; is a non-clustering loss function used to enforce additional constrains on the learned representations; and and are two hyperparameters to control the importance of the two losses. can be instantiated with a -means loss (Xie et al., 2016; Caron et al., 2018)
, a spectral clustering loss
(Tian et al., 2014; Yang et al., 2019), an agglomerative clustering loss (Yang et al., 2016), or a GMM loss (Dilokthanakul et al., 2017), enabling the representation learning for the specific targeted clustering algorithm. is often instantiated with an autoencoder-based reconstruction loss (Ghasedi Dizaji et al., 2017; Yang et al., 2019) to learn robust and/or local structure preserved representations, or to prevent collapsing clusters.After the deep clustering, the cluster assignments in the resulting function can then be utilized to compute anomaly scores based on (Jiang et al., 2001; He et al., 2003; Jiang et al., 2006; Schubert et al., 2017). However, it should be noted that the deep clustering may be biased by anomalies if the training data is anomaly-contaminated. Therefore, the above methods can be applied to the semi-supervised setting where the training data is composed by normal instances only. In the unsupervised setting, some extra constrains are required in and/or to eliminate the impact of potential anomalies.
The aforementioned deep clustering methods are focused on learning optimal clustering results. Although their resulting clustering results are applicable to anomaly detection, the learned representations may not be able to well capture the abnormality of anomalies. It is important to utilize clustering techniques to learn representations so that anomalies have clearly weaker adherence to clusters than normal instances. Some promising results for this type of approach are shown in (Zong et al., 2018; Liao et al., 2018a), in which they aim to learn representations for a GMM-based model with the representations optimized to highlight anomalies. The general formation of these two studies is similar to Eq. (33) with and respectively specified as a GMM loss and an autoencoder-based reconstruction loss, but to learn deviated representations of anomalies, they concatenate some handcrafted features based on the reconstruction errors from the autoencoders with the learned features of the autoencoder to optimize the combined features together. Since the reconstruction error-based handcrafted features capture the data normality, the resulting representations are more optimal for anomaly detection than that yielded by other deep clustering methods.
Advantages. The advantages of deep clustering-based methods are as follows. (i) A number of deep clustering methods and theories can be utilized to support the effectiveness and theoretical foundation of anomaly detection. (ii) Compared to traditional clustering-based methods, deep clustering-based methods learn specifically optimized representations that help spot the anomalies easier than on the original data, especially when dealing with intricate data sets.
Disadvantages. Their disadvantages are as follows. (i) The performance of anomaly detection is heavily dependent on the clustering results. (ii) The clustering process may be biased by contaminated anomalies in the training data, which in turn leads to less effective representations.
Challenges Targeted. The clustering-based anomaly measures are applied to newly learned low-dimensional representations of data inputs; when the new representation space preserves sufficient discrimination information, the deep methods can achieve better detection accuracy than that in the original data space (CH1 & CH2). Some clustering algorithms are sensitive to outliers, so the deep clustering and the subsequent anomaly detection can be largely misled when the given data is contaminated by anomalies, but as shown in (Zong et al., 2018), deep clustering using handcrafted features from the reconstruction errors of deep autoencoders may help learn more robust detection models w.r.t. anomaly contamination (CH4).
This research line aims at learning scalar anomaly scores in an end-to-end fashion. Compared to anomaly measure-dependent feature learning, the anomaly scoring in this type of approach is not dependent on existing anomaly measures; it has a neural network that directly learns the anomaly scores. Novel loss functions are often required to drive the anomaly scoring network. Formally, this category of methods aims at learning an end-to-end anomaly score learning network: . The underlying framework can be represented as
(34) | |||
(35) |
Below we review four main approaches to fulfill the goal in Eqs. (34-35): ranking models, prior-driven models, softmax models and end-to-end one-class classification models. The key to this framework is to incorporate order or discriminative information into the anomaly scoring network. These four approaches represent four different perspectives to design this network.
This group of methods aims to directly learn a ranking model, such that data instances can be sorted based on an observable ordinal variable associated with the absolute/relative ordering relation of the abnormality. The anomaly scoring neural network is driven by the observable ordinal variable.
Assumptions. There exists an observable ordinal variable that captures some data abnormality.
One research line of this approach is to devise ordinal regression-based loss functions to drive the anomaly scoring neural network (Pang et al., 2019b, 2020). In (Pang et al., 2020), a self-trained deep ordinal regression model is introduced to directly optimize the anomaly scores for unsupervised video anomaly detection. Particularly, it assumes an observable ordinal variable with , let , and respectively be pseudo anomaly and normal instance sets and , then the objective function is formulated as
(36) |
where is a MSE/MAE-based loss function and and . Here takes two scalar ordinal values only, so it is a two-class ordinal regression.
The end-to-end anomaly scoring network takes and as inputs and learns to optimize the anomaly scores such that the data inputs of similar behaviors as those in () receive large (small) scores as close () as possible, resulting in larger anomaly scores assigned to anomalous frames than normal frames.
Due to the superior capability of capturing appearance features of image data, ResNet-50 (He et al., 2016) is used to specify the feature network , followed by the anomaly scoring network built with a fully connected two-layer neural network. consists of a hidden layer with 100 units and an output layer with a single linear unit. Similar to (Pang et al., 2018a), and are initialized by some existing anomaly measures. The anomaly scoring model is then iteratively updated and enhanced in a self-training manner. The MAE-based loss function is employed in Eq. (36) to reduce the negative effects brought by false pseudo labels in and .
Different from (Pang et al., 2020) that addresses an unsupervised setting, a weakly-supervised setting is assumed in (Pang et al., 2019b; Sultani et al., 2018). A very small number of labeled anomalies, together with large-scale unlabeled data, is assumed to be available during training in (Pang et al., 2019b). To leverage the known anomalies, the anomaly detection problem is formulated as a pairwise relation prediction task. Specifically, a two-stream ordinal regression network is devised to learn the relation of randomly sampled pairs of data instances, i.e., to discriminate whether the instance pair contains two labeled anomalies, one labeled anomaly, or just unlabeled data instances. Let be the small labeled anomaly set, be the large unlabeled dataset and , is first generated. Here is a set of random instance pairs with synthetic ordinal class labels, where is an ordinal variable. The synthetic label means an ordinal value for any instance pairs with the instances and respectively sampled from and . is predefined such that the pairwise prediction task is equivalent to anomaly score learning. The method can then be formally framed as
(37) |
which is trainable in an end-to-end fashion. By minimizing Eq. (37), the model is optimized to learn larger anomaly scores for the input pairs that contain two anomalies than the pairs with one anomaly or none. At the evaluation stage, each test instance is paired with instances from or to obtain the anomaly scores.
The weakly-supervised setting in (Sultani et al., 2018) addresses frame-level video anomaly detection but only a set of video-level class labels is available during training, i.e., a video is normal or contains abnormal frames somewhere, but we do not know which specific frames are anomalies. A multiple instance learning (MIL)-based ranking model is introduced in (Sultani et al., 2018) to harness the high-level class labels to directly learn the anomaly score for each video segment (i.e., a small number of consecutive video frames). Its key objective is to guarantee that the maximum anomaly score for the segments in a video that contains anomalies somewhere is greater than the counterparts in a normal video. To achieve this, each video is treated as a bag of instances in MIL, the videos that contains anomalies are treated as positive bags, and the normal videos are treated as negative bags. Each video segment is an instance in the bag. The ordering information of the anomaly scores is enforced as a relative pairwise ranking order via the hinge loss function. The overall objective function is defined as
(38) |
where is a video segment, contains a bag of video segments, and and respectively represents positive and negative bags. The first term is to guarantee the relative anomaly score order, i.e., the anomaly score of the most abnormal video segment in the positive instance bag is greater than that in the negative instance bag. The second and the last terms are two extra optimization constraints, in which the former one enforces score smoothness between consecutive video segments while the latter one enforces anomaly sparsity, i.e., each video contains only a few abnormal segments.
Advantages. The advantages of deep ranking model-base methods are as follows. (i) The anomaly scores can be optimized directly with adapted loss functions. (ii) They are generally free from the definitions of anomalies by imposing a weak assumption of the ordinal order between anomaly and normal instances. (iii) This approach may build upon well-established ranking techniques and theories from areas like learning to rank (Liu et al., 2009; Joachims et al., 2017; Liu et al., 2018c).
Disadvantages. Their disadvantages are as follows. (i) At least some form of labeled anomalies are required in these methods, which may not be applicable to applications where such labeled anomalies are not available. The method in (Pang et al., 2020) is fully unsupervised and obtains some promising performance but there is still a large gap compared to semi-supervised methods. (ii) Since the models are exclusively fitted to detect the few labeled anomalies, they may not be able to generalize to unseen anomalies that exhibit different abnormal features to the labeled anomalies.
Challenges Targeted: Using weak supervision such as pseudo labels or noisy class labels provide some important knowledge of suspicious anomalies, enabling the learning of more expressive low-dimensional representation space and better detection accuracy (CH1, CH2). The MIL scheme (Sultani et al., 2018) and the pairwise relation prediction (Pang et al., 2019b) provide an easy way to incorporate coarse-grained/limited anomaly labels to detection model learning (CH3
). More importantly, the end-to-end anomaly score learning offers straightforward anomaly explanation by backpropagating the activation weights or the gradient of anomaly scores to locate the features that are responsible for large anomaly scores
(Pang et al., 2020) (CH6). In addition, the methods in (Pang et al., 2019b, 2020) also work well in data with anomaly contamination or noisy labels (CH4).This approach uses a prior distribution to encode and drive the anomaly score learning. Since the anomaly scores are learned in an end-to-end manner, the prior may be imposed on either the internal module or the learning output (i.e., anomaly scores) of the score learning function .
Assumptions. The imposed prior captures the underlying (ab)normality of the dataset.
The incorporation of the prior into the internal anomaly scoring function is exemplified by a recent study on the Bayesian inverse reinforcement learning-based sequential anomaly detection
(Oh and Iyengar, 2019). The key intuition of the idea is that given an agent that takes a set of sequential data as input, the agent’s normal behavior can be understood by its latent reward function, and thus a test sequence is identified as anomaly if the agent assigns a low reward to the sequence. Inverse reinforcement learning (IRL) approaches (Ng and Russell, 2000) are used to infer the reward function. To learn the reward function more efficiently, a sample-based IRL approach is used. Specifically, the IRL problem is formulated as the below posterior optimization problem(39) |
where , is a latent reward function parameterized by , is a pair of state and action in the sequence , represents the partition function which is the integral of
over all the sequences consistent with the underlying Markov decision process dynamics,
is a prior distribution of , and is a set of observed sequences. Since the inverse of the reward yielded by is used as the anomaly score, maximizing Eq. (39) is equivalent to directly learning the anomaly scores.At the training stage, a Gaussian prior distribution over the weight parameters of the reward function learning network is assumed, i.e., . The partition function is estimated using a set of sequences generated by a sample-generating policy ,
(40) |
The policy is also represented as a neural network. and are alternatively optimized, i.e., to optimize the reward function with a fixed policy and to optimize with the updated reward function . Note that is instantiated with a bootstrap neural network with multiple output heads in (Oh and Iyengar, 2019); Eq. (39) presents a simplified for brevity.
The idea of enforcing a prior on the anomaly scores is explored in (Pang et al., 2019a). Motivated by the extensive empirical results in (Kriegel et al., 2011)
that show the anomaly scores in a variety of real-world data sets fits Gaussian distribution very well, the work uses a Gaussian prior to encode the anomaly scores and enable the direct optimization of the scores. That is, it is assumed that the anomaly scores of normal instances are clustered together while that of anomalies deviate far away from this cluster. The prior is leveraged to define the following loss function, called deviation loss, which is built upon the well-known contrastive loss
(Hadsell et al., 2006).(41) | ||||
(42) |
where and
are respectively the estimated mean and the standard deviation of the Gaussian prior
, if is an anomaly and if is a normal object, andis equivalent to a Z-Score confidence interval parameter.
and are estimated using a set of values drawn from for each batch of instances to learn robust representations of normality and abnormality.The detection model is driven by the deviation loss to push the anomaly scores of normal instances as close as possible to while guaranteeing at least standard deviations between and the anomaly scores of anomalies. When is an anomaly and it has a negative , the loss would be particularly large, resulting in large positive deviations for all anomalies. As a result, the deviation loss is equivalent to enforcing a statistically significant deviation of the anomaly score of the anomalies from that of normal instances in the upper tail. In addition to the end-to-end anomaly score learning, this Gaussian prior-driven loss also results in well interpretable anomaly scores, i.e., given any anomaly score , we can use the Z-score confidence interval to explain the abnormality of the instance . This is an important and very practical property that existing methods do not have.
Advantages. The advantages of prior-driven models are as follows. (i) The anomaly scores can be directly optimized w.r.t. a given prior. (ii) It provides a flexible framework for incorporating different prior distributions into the anomaly score learning. Different Bayesian deep learning techniques (Wang and Yeung, 2016) may be adapted for anomaly detection. (iii) The prior can also result in more interpretable anomaly scores than the other methods.
Disadvantages. Their disadvantages are as follows. (i) It is difficult, if not impossible, to design a universally effective prior for different anomaly detection application scenarios. (ii) The models may work less effectively if the prior does not fit the underlying distribution well.
Challenges Targeted: The prior empowers the models to learn informed low-dimensional representations of different complex data such as high-dimensional data and sequential data (CH1 & CH2). By imposing a prior over anomaly scores, the deviation network method (Pang et al., 2019a) shows promising performance in leveraging a limited amount of labeled anomaly data to enhance the representations of normality and abnormality, substantially boosting the detection recall (CH1 & CH3). The detection models here are driven by a prior distribution w.r.t. anomaly scoring function and work well in data with anomaly contamination in the training data (CH4).
This type of approach aims at learning anomaly scores by maximizing the likelihood of events in the training data. Since anomaly and normal instances respectively correspond to rare and frequent patterns, from the probabilistic perspective, normal instances are presumed to be high-probability events whereas anomalies are prone to be low-probability events. Therefore, the negative of the event likelihood can be naturally defined as anomaly score.
Assumptions. Anomalies and normal instances are respectively low- and high-probability events.
The idea of learning anomaly scores by directly modeling the event likelihood is introduced in (Chen et al., 2016). Particularly, the problem is framed as
(43) |
where is the probability of the instance (i.e., an event in the event space) with the parameters to be learned. To easy the optimization, is modeled with a softmax function:
(44) |
where is an anomaly scoring function designed to capture pairwise feature interactions:
(45) |
where is a low-dimensional embedding of the th feature value of in the representation space , is the weight added to the interaction and is a trainable parameter. Since is a normalization term, learning the likelihood function is equivalent to directly optimizing the anomaly scoring function
. The computation of this explicit normalization term is prohibitively costly, the well-established noise-contrastive estimation (NCE)
(Gutmann and Hyvärinen, 2010) is used in (Chen et al., 2016) to learn the following approximated likelihood(46) |
where and ; for each instance , noise samples are generated from some synthetic known ‘noise’ distribution . In (Chen et al., 2016), a context-dependen method is used to generate the negative samples by univariate extrapolation of the observed instance .
The method is primarily designed to detect anomalies in categorical data (Chen et al., 2016). Motivated by this application, a similar objective function is adapted to detect abnormal events in heterogeneous attributed bipartite graphs (Fan et al., 2018). The problem in (Fan et al., 2018) is to detect anomalous paths that span both partitions of the bipartite graph. Therefore, in Eq. (45) is a graph path containing a set of heterogeneous graph nodes, with and be the representations of every pair of the nodes in the path. To map attributed nodes into the representation space , multilayer perceptron networks and autoencoders are respectively applied to the node features and the graph topology.
Advantages. The advantages of softmax model-based methods are as follows. (i) Different types of interactions can be incorporated into the anomaly score learning process. (ii) The anomaly scores are faithfully optimized w.r.t. the specific abnormal interactions we aim to capture.
Disadvantages. Their disadvantages are as follows. (i) The computation of the interactions can be very costly when the number of features/elements in each data instance is large, i.e., we have time complexity per instance for -th order interactions of features/elements. (ii) The anomaly score learning is heavily dependent on the quality of the generation of negative samples.
Challenges Targeted: The formulation in this category of methods provides a promising way to learn low-dimensional representations of datasets with heterogeneous data sources (CH2 & CH5). The learned representations often capture more normality/abnormality information from different data sources and thus enable better detection than traditional methods (CH1).
This category of approach aims to train a one-class classifier that learns to discriminate whether a given instance is normal or not in an end-to-end manner. Different from the methods in Section 5.2.2, this group of methods does not rely on any existing one-class classification measures such as one-class SVM or SVDD; it directly learns a one-class discrimination model. This approach emerges mainly due to the marriage of GANs and the concept of one-class classification, i.e., adversarially learned one-class classification. The key idea is to learn a one-class discriminator of the normal instances so that it well discriminates from some adversarially generated pseudo anomalies. Note that this approach is also very different from the GAN-based methods in Section 5.1.2 due to two key differences. First, the GAN-based methods aim to learn a generative distribution to maximally approximate the real data distribution, achieving a generative model that well captures the normality of the training normal instances; while the methods in this section aim to optimize a discriminative model to separate training normal instances from adversarially generated fringe instances. Second, the GAN-based methods define the anomaly scores based on the residual between the real instances and the corresponding generated instances, whereas the methods here directly use the discriminative model to classify anomalies, i.e., the discriminator acts as in Eq. (34). This section is separated from Sections 5.1.2 and 5.2.2 to highlight the above differences.
Assumptions. Two basic assumptions of this approach are as follows. (i) Data instances that are approximated to anomalies can be effectively synthesized. (ii) All normal instances can be summarized by a discriminative one-class model.
The idea of adversarially learned one-class (ALOCC) classification is first studied in (Sabokrou et al., 2018). The key idea is to train two deep networks, with one network trained as the one-class model to separate normal instances from anomalies while the other network trained to enhance the normal instances and generate distorted outliers. The two networks are instantiated and optimized through the GANs approach. The one-class model is built upon the discriminator network and the generator network is based on a denoising AE (Vincent et al., 2010). The objective of the AE-empower GAN is defined as
(47) |
where denotes a data distribution of corrupted by a Gaussian noise, i.e., with . This objective is jointly optimized with the following data construction error in .
(48) |
The intuition in Eq. (47) is that can well reconstruct (and even enhance) normal instances, but it can be confused by input outliers and consequently generates distorted outliers. Through the minimax optimization, the discriminator learns to better discriminate normal instances from the outliers than using the original data instances. Thus, can be directly used to detect anomalies. In (Sabokrou et al., 2018) the outliers are randomly drawn from some classes other than the classes where the normal instances come from.
However, obtaining the reference outliers beyond the given training data as in (Sabokrou et al., 2018) may be unavailable in many domains. Instead of taking random outliers from other datasets, we can generate fringe data instances based on the given training data and use them as negative reference instances to enable the training of the one-class discriminator. This idea is explored in (Zheng et al., 2019; Ngo et al., 2019). One-class adversarial networks (OCAN) is introduced in (Zheng et al., 2019) to leverage the idea of bad GANs (Dai et al., 2017) to generate fringe instances based on the distribution of the normal training data. Unlike conventional generators in GANs, the generator network in bad GANs is trained to generate data instances that are complementary, rather than matching, to the training data. The objective of the complement generator is as follows
(49) |
where is the entropy, is an indicator function, is a threshold hyperparameter, and
is a feature mapping derived from an intermediate layer of the discriminator. The first two terms are devised to generate low-density instances in the original feature space. However, it is computationally infeasible to obtain the probability distribution of the training data. Instead the density estimation
is approximated by the discriminator of a regular GAN. The last term is the widely-used feature matching loss that helps better generate data instances within the original data space. The objective of the discriminator in OCAN is enhanced with an extra conditional entropy term to enable the detection with high confidence:(50) |
In (Ngo et al., 2019), Fence GAN is introduced with the objective to generate data instances tightly lying at the boundary of the distribution of the training data. This is achieved by introducing two loss functions into the generator that enforce the generated instances to be evenly distributed along a sphere boundary of the training data. Formally, the objective of the generator is defined as
(51) |
where is a hyperparameter used as a discrimination reference score for the generator to generate the boundary instances and is the center of the generated data instances. The first term is called encirclement loss that enforces the generated instances to have the same discrimination score, ideally resulting in instances tightly enclosing the training data. The second term is called dispersion loss that enforces the generated instances to evenly cover the whole boundary.
There have been some other methods introduced to effectively generate the reference instances. For example, uniformly distributed instances can be generated to enforce the normal instances to be distributed uniformly across the latent space
(Perera et al., 2019); an ensemble of generators is used in (Liu et al., 2019), with each generator synthesizing boundary instances for one specific cluster of normal instances.Advantages. The advantages of this category of methods is as follows. (i) Its anomaly classification model is adversarially optimized in an end-to-end fashion. (ii) It can be developed and supported by the affluent techniques and theories of adversarial learning and one-class classification.
Disadvantages. Their disadvantages are as follows. (i) It is difficult to guarantee that the generated reference instances well resemble the unknown anomalies. (ii) The instability of GANs may result in the generation of instances with diverse quality, leading to unstable discriminator-based anomaly classification performance. This issue is recently studied in (Zaheer et al., 2020), which shows that the performance of this type of anomaly detectors can fluctuate drastically in different training steps. (iii) Its applications are limited to semi-supervised anomaly detection scenarios.
Challenges Targeted: The adversially learned one-class classifiers learn to generate realistic fringe/boundary instances, enabling the learning of expressive low-dimensional normality representations (CH1 & CH2).
To gain a more in-depth understanding of methods in this area, we present a summary of key characteristics of representative algorithms from each category of approach in Table 1. Some main observations include: (i) most methods operate in an unsupervised or semi-supervised mode; (ii) deep learning tricks like data augmentation, dropout and pre-training are under-explored; (iii) the network architecture used in most of the methods is not that deep, with a majority of the methods having no more than five network layers; (iv) ReLU or leaky ReLU is the most popular activation function; and (v) deep learning can be leveraged to detect anomalies in different types of input data. The source code of most of these representative algorithms is publicly accessible. We summarize the source of the codes in Table 2 to facilitate the access.
Method | Ref. | Sup. | Objective | DA | DP | PT | Archit. | Activation | # layers | Loss | Data |
---|---|---|---|---|---|---|---|---|---|---|---|
OADA | (Ionescu et al., 2019) (4) | Semi | Reconstruction | Yes | No | No | AE | ReLU | 3 | MSE | Video |
Replicator | (Hawkins et al., 2002) (5.1.1) | Unsup. | Reconstruction | No | No | No | AE | Tanh | 2 | MSE | Tabular |
RandNet | (Chen et al., 2017) (5.1.1) | Unsup. | Reconstruction | No | Yes | Yes | AE | ReLU | 3 | MSE | Tabular |
RDA | (Zhou and Paffenroth, 2017) (5.1.1) | Semi | Reconstruction | No | No | No | AE | Sigmoid | 2 | MSE | Tabular |
UODA | (Lu et al., 2017) (5.1.1) | Semi | Reconstruction | No | No | Yes | AE & RNN | Sigmoid | 4 | MSE | Sequence |
AnoGAN | (Schlegl et al., 2017) (5.1.2) | Semi | Generative | No | No | No | Conv. | ReLU | 4 | MAE | Image |
EBGAN | (Zenati et al., 2018a) (5.1.2) | Semi | Generative | No | No | No | Conv. & MLP | ReLU/lReLU | 3-4 | GAN | Image & Tabular |
FFP | (Liu et al., 2018b) (5.1.3) | Semi | Predictive | Yes | No | Yes | Conv. | ReLU | 10 | MAE/MSE | Video |
LSA | (Abati et al., 2019) (5.1.3) | Semi | Predictive | No | No | No | Conv. | lReLU | 4-7 | MSE & KL | video |
GT | (Golan and El-Yaniv, 2018) (5.1.4) | Semi | Classification | Yes | Yes | No | Conv. | ReLU | 10-16 | CE | Image |
EOutlier | (Wang et al., 2019) (5.1.4) | Semi | Classification | Yes | Yes | No | Conv. | ReLU | 10 | CE | Image |
REPEN | (Pang et al., 2018a) (5.2.1) | Unsup. | Distance | No | No | No | MLP | ReLU | 1 | Hinge | Tabular |
RDP | (Wang et al., 2020) (5.2.1) | Unsup. | Distance | No | No | No | MLP | lReLU | 1 | MSE | Tabular |
AE-1SVM | (Nguyen and Vien, 2018) (5.2.2) | Unsup. | One-class | No | No | No | AE & Conv. | Sigmoid | 2-5 | Hinge | Tabular & image |
DeepOC | (Wu et al., 2019)(5.2.2) | Semi | One-class | No | No | No | 3D Conv. | ReLU | 5 | Hinge | Video |
Deep SVDD | (Ruff et al., 2018) (5.2.2) | Semi | One-class | No | No | Yes | Conv. | lReLU | 3-4 | Hinge | Image |
Deep SAD | (Ruff et al., 2020) (5.2.2) | Semi | One-class | No | No | Yes | Conv. & MLP | lReLU | 3-4 | Hinge | Image & Tabular |
DEC | (Xie et al., 2016) (5.2.3) | Unsup. | Clustering | No | Yes | Yes | MLP | ReLU | 4 | KL | Image & Tabular |
DAGMM | (Zong et al., 2018) (5.2.3) | Unsup. | Clustering | No | Yes | No | AE & MLP | Tanh | 4-6 | Likelihood | Tabular |
SDOR | (Pang et al., 2020) (6.1) | Unsup. | Anomaly scores | No | No | Yes | ResNet & MLP | ReLU | 50 + 2 | MAE | Video |
PReNet | (Pang et al., 2019b) (6.1) | Weak | Anomaly scores | Yes | No | No | MLP | ReLU | 2-4 | MAE | Tabular |
MIL | (Sultani et al., 2018) (6.1) | Weak | Anomaly scores | No | Yes | Yes | 3DConv. & MLP | ReLU | 18/34 + 3 | Hinge | Video |
PUP | (Oh and Iyengar, 2019) (6.2) | Unsup. | Anomaly scores | No | No | No | MLP | ReLU | 3 | Likelihood | Sequence |
DevNet | (Pang et al., 2019a) (6.2) | Weak | Anomaly scores | No | No | No | MLP | ReLU | 2-4 | Deviation | Tabular |
APE | (Chen et al., 2016) (6.3) | Unsup. | Anomaly scores | No | No | No | MLP | Sigmoid | 3 | Softmax | Tabular |
AEHE | (Fan et al., 2018) (6.3) | Unsup. | Anomaly scores | No | No | No | AE & MLP | ReLU | 4 | Softmax | Graph |
ALOCC | (Sabokrou et al., 2018) (6.4) | Semi | Anomaly scores | Yes | No | No | AE & CNN | lReLU | 5 | GANs | Image |
OCAN | (Zheng et al., 2019) (6.4) | Semi | Anomaly scores | No | No | Yes | LSTM-AE & MLP | ReLU | 4 | GANs | Sequence |
Fence GAN | (Ngo et al., 2019) (6.4) | Semi | Anomaly scores | No | Yes | No | Conv. & MLP | lReLU/Sigmoid | 4-5 | GANs | Image & Tabular |
OCGAN | (Perera et al., 2019) (6.4) | Semi | Anomaly scores | No | No | No | Conv. | ReLU/Tanh | 3 | GANs | Image |
Method | API | Link | Section |
---|---|---|---|
RDA (Zhou and Paffenroth, 2017) | Tensorflow | https://git.io/JfYG5 | 5.1.1 |
AnoGAN (Schlegl et al., 2017) | Tensorflow | https://git.io/JfGgc | 5.1.2 |
Fast AnoGAN (Schlegl et al., 2019) | Tensorflow | https://git.io/JfZRn | 5.1.2 |
EBGAN (Zenati et al., 2018a) | Keras | https://git.io/JfGgG | 5.1.2 |
ALAD (Zenati et al., 2018b) | Keras | https://git.io/JfZ8v | 5.1.2 |
GANomaly (Akcay et al., 2018) | PyTorch | https://git.io/JfGgn | 5.1.2 |
FFP (Liu et al., 2018b) | Tensorflow | https://git.io/Jf4pc | 5.1.3 |
LSA (Abati et al., 2019) | Torch | https://git.io/Jf4pW | 5.1.3 |
GT (Golan and El-Yaniv, 2018) | Keras | https://git.io/JfZRW | 5.1.4 |
EOutlier (Wang et al., 2019) | PyTorch | https://git.io/Jf4pl | 5.1.4 |
REPEN (Pang et al., 2018a) | Keras | https://git.io/JfZRg | 5.2.1 |
RDP (Wang et al., 2020) | PyTorch | https://git.io/RDP | 5.2.1 |
AE-1SVM (Nguyen and Vien, 2018) | Tensorflow | https://git.io/JfGgl | 5.2.2 |
OC-NN (Chalapathy et al., 2018) | Keras | https://git.io/JfGgZ | 5.2.2 |
Deep SVDD (Ruff et al., 2018) | Tensorflow | https://git.io/JfZRR | 5.2.2 |
Deep SAD (Ruff et al., 2020) | PyTorch | https://git.io/JfOkr | 5.2.2 |
DAGMM (Zong et al., 2018) | PyTorch | https://git.io/JfZR0 | 5.2.3 |
MIL (Sultani et al., 2018) | Keras | https://git.io/JfZRz | 6.1 |
DevNet (Pang et al., 2019a) | Keras | https://git.io/JfZRw | 6.2 |
ALOCC (Sabokrou et al., 2018) | Tensorflow | https://git.io/Jf4p4 | 6.4 |
OCAN (Zheng et al., 2019) | Tensorflow | https://git.io/JfYGb | 6.4 |
FenceGAN (Ngo et al., 2019) | Keras | https://git.io/Jf4pR | 6.4 |
OCGAN (Perera et al., 2019) | MXNet | https://git.io/Jf4p0 | 6.4 |
Links to Access 23 Open-source Algorithms.
One main obstacle to the development of anomaly detection is the lack of real-world datasets with real anomalies. Many studies (e.g., (Zhou and Paffenroth, 2017; Zenati et al., 2018a; Akcay et al., 2018; Golan and El-Yaniv, 2018; Wang et al., 2019; Ruff et al., 2018; Ngo et al., 2019)) evaluate the performance of their presented methods on datasets converted from popular classification data for this reason. This way may fail to reflect the performance of the methods in real-world anomaly detection applications as the characteristics of anomalies in the converted data can be different from the real anomalies in practice. We summarize a collection of 21 publicly available real-world datasets with real anomalies in Table 3 to promote the performance evaluation on these datasets. The datasets cover a wide range of popular application domains presented in a variety of data types. Only large-scale and/or high-dimensional complex datasets are included here to provide challenging testbeds for deep anomaly detection methods.
Domain | Data | Size | Dimension | Anomaly (%) | Type | Reference |
---|---|---|---|---|---|---|
Intrusion detection | KDD Cup 99 (Bache and Lichman, 2013) | 4,091-567,497 | 41 | 0.30%-7.70% | Tabular | (Hawkins et al., 2002; Zong et al., 2018; Nguyen and Vien, 2018; Ngo et al., 2019) |
Intrusion detection | UNSW-NB15 (Moustafa and Slay, 2015) | 257,673 | 49 | 9.71% | Streaming | (Pang et al., 2019b, a) |
Excitement prediction | KDD Cup 14 | 619,326 | 10 | 6.00% | Tabular | (Pang et al., 2019b, a) |
Dropout prediction | KDD Cup 15 | 35,091 | 27 | 0.10%-0.40% | Sequence | (Lu et al., 2017) |
Malicious URLs detection | URL (Ma et al., 2009) | 2.4m | 3.2m | 33.04% | Streaming | (Pang et al., 2018a) |
Spam detection | Webspam (Webb et al., 2006) | 350,000 | 16.6m | 39.61% | Tabular/text | (Pang et al., 2018a) |
Fraud detection | Credit-card-fraud (Dal Pozzolo et al., 2017) | 284,807 | 30 | 0.17% | Streaming | (Zheng et al., 2019; Pang et al., 2019b, a) |
Vandal detection | UMDWikipedia (Kumar et al., 2015) | 34,210 | N/A | 50.00% | Sequence | (Zheng et al., 2019) |
Mutant activity detection | p53 Mutants (Danziger et al., 2009) | 16,772 | 5,408 | 0.48% | Tabular | (Pang et al., 2018a) |
Internet ads detection | AD (Bache and Lichman, 2013) | 3,279 | 1,555 | 14.00% | Tabular | (Pang et al., 2018a) |
Disease detection | Thyroid (Bache and Lichman, 2013) | 7,200 | 21 | 7.40% | Tabular | (Pang et al., 2019b, a; Zong et al., 2018; Ruff et al., 2020) |
Disease detection | Arrhythmia (Bache and Lichman, 2013) | 452 | 279 | 14.60% | Tabular | (Pang et al., 2015; Zong et al., 2018; Ruff et al., 2020) |
Defect detection | MVTec AD | 5,354 | N/A | 35.26% | Image | (Bergmann et al., 2019) |
Video surveillance | UCSD Ped 1 (Li et al., 2013) | 14,000 frames | N/A | 28.6% | Video | (Pang et al., 2020; Wu et al., 2019) |
Video surveillance | UCSD Ped 2 (Li et al., 2013) | 4,560 frames | N/A | 35.9% | Video | (Pang et al., 2020; Wu et al., 2019) |
Video surveillance | UMN (of Minnesota, 2020) | 7,739 frames | N/A | 15.5%- 18.1% | Video | (Pang et al., 2020) |
Video surveillance | Avenue (Lu et al., 2013) | 30,652 frames | N/A | 12.46% | Video | (Wu et al., 2019) |
Video surveillance | ShanghaiTech Campus (Liu et al., 2018b) | 317,398 frames | N/A | 5.38% | Video | (Liu et al., 2018b) |
Video surveillance | UCF-Crime (Sultani et al., 2018) | 1,900 videos (13.8m frames) | N/A | 13 crimes | Video | (Sultani et al., 2018) |
System log analysis | HDFS Log (Xu et al., 2009) | 11.2m | N/A | 2.90% | Sequence | (Du et al., 2017) |
System log analysis | OpenStack log | 1.3m | N/A | 7.00% | Sequence | (Du et al., 2017) |
In this work we categorize the current deep anomaly detection methods into three high-level categories and 11 fine-grained categories, representing 12 diverse modeling perspectives on harnessing deep learning techniques for anomaly detection. We also discuss how these methods address some notorious anomaly detection challenges to demonstrate the importance of deep anomaly detection. Through such a review, we identify some exciting opportunities as follows.
Informative supervisory signals are the key for deep anomaly detection to learn expressive representations of normality/abnormality or anomaly scores and reduce false positives. While a wide range of unsupervised or self-supervised supervisory signals have been explored, as discussed in Section 5.1, to learn the representations, a key issue for these formulations is that their objective functions are generic but not optimized specifically for anomaly detection. Anomaly measure-dependent feature learning in Section 5.2 helps address this issue by imposing constraints derived from traditional anomaly measures. However, these constraints can have some inherent limitations, e.g., implicit assumptions in the anomaly measures. It is critical to explore new sources of anomaly-supervisory signals that lie beyond the widely-used formulations such as data reconstruction and GANs, and have weak assumptions on the anomaly distribution. Another possibility is to develop domain-driven anomaly detection by leveraging domain knowledge (Cao et al., 2010) such as application-specific knowledge of anomaly and/or expert rules as the supervision source.
Deep weakly-supervised anomaly detection aims at leveraging deep neural networks to learn anomaly-informed detection models with some weakly-supervised anomaly signals, e.g., partially/inexactly/inaccurately labeled anomaly data. This labeled data provides important knowledge of anomaly and can be a major driving force to lift detection recall rates (Pang et al., 2018a, 2019b, 2019a; Sultani et al., 2018; Tamersoy et al., 2014). One exciting opportunity is to utilize a small number of accurate labeled anomaly examples to enhance detection models as they are often available in real-world applications, e.g., some intrusions/frauds from deployed detection systems/end-users and verified by human experts. However, since anomalies can be highly heterogeneous, there can be unknown/novel anomalies that lie beyond the span set of the given anomaly examples. Thus, one important direction here is unknown anomaly detection, in which we aim to build detection models that are generalized from the limited labeled anomalies to unknown anomalies. Some recent studies (Pang et al., 2019b; Ruff et al., 2020; Pang et al., 2019a) show that deep detection models are able to learn abnormality that lie beyond the scope of the given anomaly examples. It would be important to further understand and explore the extent of the generalizability and to develop models to further improve the accuracy performance.
To detect anomalies that belong to the same classes of the given anomaly examples can be as important as the detection of novel/unknown anomalies. Thus, another important direction is to develop data-efficient anomaly detection or few-shot anomaly detection, in which we aim at learning highly expressive representations of the known anomaly classes given only limited anomaly examples (Pang et al., 2018a, 2019b; Tian et al., 2020; Pang et al., 2019a). It should be noted that the limited anomaly examples may come from different anomaly classes and thus exhibit completely different manifold/class features. This scenarios is fundamentally different from the general few-shot learning (Wang et al., [n.d.]), in which the limited examples are class-specific and assumed to share the same manifold/class structure. Additionally, as shown in Table 1, the network architectures are mostly not as deep as that in other machine learning tasks. This may be partially due to the limitation of the labeled training data size. It is important to explore the possibility of leveraging those small labeled data to learn more powerful detection models with deeper architectures.
Also, inexact or inaccurate (e.g., coarse-grained) anomaly labels are often inexpensive to collect in some applications (Sultani et al., 2018); learning deep detection models with this weak supervision is important in these scenarios.
Large-scale unsupervised/self-supervised representation learning has gained tremendous success in enabling downstream learning tasks (Peters et al., 2018; Devlin et al., 2018). This is particular important for learning tasks, in which it is difficult to obtain sufficient labeled data, such as anomaly detection (see Section 2.1). The goal is to learn transferable pre-trained representation models from large-scale unlabeled data in an unsupervised/self-supervised mode and fine-tune detection models in a semi-supervised mode. The self-supervised classification-based methods in Section 5.1.3 may provide some initial sources of supervision for the normality learning. However, precautions must be taken to ensure that (i) the unlabeled data is free of anomaly contamination and/or (ii) the representation learning methods are robust w.r.t. possible anomaly contamination. This is because most methods in Sections 5 implicitly assume that the training data is clean and does not contain any noise/anomaly instances. This robustness is important in both the pre-trained modeling and the fine-tuning stage. Additionally, anomalies and datasets in different domains vary significantly, so the large-scale normality learning may need to be domain/application-specific.
Most of existing deep anomaly detection methods focus on point anomalies, showing substantially better performance than traditional methods. However, deep models for conditional/group anomalies have been significantly less explored. Deep learning has superior capability in capturing complex temporal/spatial dependence and learning representations of a set of unordered data points; it is important to explore whether deep learning could also gain similar advantages over traditional approaches in detecting such complex anomalies. Novel neural network layers or objectives functions may be required.
Similar to traditional methods, current deep anomaly detection mainly focus on single data sources. Multimodal anomaly detection is a largely unexplored research area. It is difficult for traditional approaches to bridge the gap presented by those multimodal data. Deep learning has demonstrated tremendous success in learning feature representations from different types of raw data for anomaly detection (Ding et al., 2019; Ionescu et al., 2019; Lu et al., 2017; Pang et al., 2018a; Sabokrou et al., 2018); it is also able to concatenate the representations from different data sources to learn unified representations (Goodfellow et al., 2016), so deep approaches present important opportunities of multimodal anomaly detection.
Current deep anomaly detection research mainly focuses on the detection accuracy aspect. Interpretable deep anomaly detection and actionable deep anomaly detection are essential for understanding model decisions and results, mitigating any potential bias/risk against human users, and enabling decision-making actions. In recent years, there have been some studies (Angiulli et al., 2009; Duan et al., 2015; Vinh et al., 2016; Angiulli et al., 2017; Siddiqui et al., 2019) that explore the anomaly explanation issues by searching for a subset of features that makes a reported anomaly most abnormal. The abnormal feature selection methods (Pang et al., 2016; Azmandian et al., 2012; Pang et al., 2017) may also be utilized for anomaly explanation purpose. The anomalous feature searching in these methods is independent from the anomaly detection methods, and thus, may be used to provide explanation of anomalies identified by any anomaly detection methods, including deep models. On the other hand, this model-agnostic approach may render the explanation less useful, because they cannot provide a genuine understanding of the mechanisms underlying specific detection models, resulting in potential modeling bias/risk and weak interpretability and actionability (e.g., quantifying the impact of detected anomalies and mitigation actions). Deep models with inherent capability to provide anomaly explanation is important, such as (Pang et al., 2020). To achieve this, methods for deep model explanation (Du et al., 2019) and actionable knowledge discovery (Cao et al., 2010) could be explored with deep anomaly detection models.
There have been some exciting emerging research applications and problem settings, into which there could be some important opportunities of extending deep detection methods. First, out-of-distribution (OOD) detection (Hendrycks and Gimpel, 2017; Lee et al., 2018; Ren et al., 2019) is another closely related area, which detects data instances that are anomalous or significantly different from the instances used in training. This is an essential technique to enable machine learning systems to deal with instances of novel classes in open-world environments. OOD detection is also an anomaly detection task, but in OOD detection it is generally assumed that fine-grained normal class labels are available during training, and we need to retain the classification accuracy of these normal classes while performing accurate OOD detection. Second, curiosity learning (Pathak et al., 2017; Burda et al., 2019a; Burda et al., 2019b) aims at learning a bonus reward function in reinforcement learning with sparse rewards. Particularly, reinforcement learning algorithms often fail to work in an environment with very sparse rewards. Curiosity learning addresses this problem by augmenting the environment with a bonus reward in addition to the original sparse rewards from the environment. This bonus reward is defined typically based on the novelty or rarity of the states, i.e., the agent receives large bonus rewards if it discovers novel/rare states. The novel/rare states are concepts similar to anomalies. Therefore, it would be interesting to explore how deep anomaly detection could be utilized to enhance this challenging reinforcement learning problem; conversely, there can be opportunities to leverage curiosity learning techniques for anomaly detection, such as the method in (Wang et al., 2020). Third, most shallow and deep models for anomaly detection assume that the abnormality of data instances is independent and identically distributed (IID), while the abnormality in real applications may suffer from some non-IID characteristics, e.g., the abnormality of different instances/features is interdependent and/or heterogeneous (Pang, 2019). For example, the abnormality of multiple synchronized disease symptoms is mutually reinforced in early detection of diseases. This requires non-IID anomaly detection (Pang, 2019) that is dedicated to learning such non-IID abnormality. This task is crucial in complex scenarios, e.g., where anomalies have only subtle deviations and are hidden in the data space if not considering these non-IID abnormality characteristics. Lastly, other interesting applications include detection of adversarial examples (Grosse et al., 2017; Paudice et al., 2018), anti-spoofing in biometric systems (Pérez-Cabo et al., 2019; Fatemifar et al., 2019), and early detection of rare catastrophic events (e.g., financial crisis (Cao and Cao, 2015) and other black swan events (Aven, 2016)).
Latent space autoregression for novelty detection. In
CVPR. 481–490.Deep clustering for unsupervised learning of visual features. In
ECCV. 132–149.Good semi-supervised learning that requires a bad gan. In
NeurIPS. 6510–6520.Predicting positive p53 cancer rescue regions using Most Informative Positive (MIP) active learning.
PLoS Computational Biology 5, 9 (2009).Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research 11, Dec (2010), 3371–3408.Unsupervised deep embedding for clustering analysis. In
ICML. 478–487.Unpaired image-to-image translation using cycle-consistent adversarial networks. In
ICCV. 2223–2232.