Enhancing Unsupervised Anomaly Detection with Score-Guided Network

09/10/2021 ∙ by Zongyuan Huang, et al. ∙ 0

Anomaly detection plays a crucial role in various real-world applications, including healthcare and finance systems. Owing to the limited number of anomaly labels in these complex systems, unsupervised anomaly detection methods have attracted great attention in recent years. Two major challenges faced by the existing unsupervised methods are: (i) distinguishing between normal and abnormal data in the transition field, where normal and abnormal data are highly mixed together; (ii) defining an effective metric to maximize the gap between normal and abnormal data in a hypothesis space, which is built by a representation learner. To that end, this work proposes a novel scoring network with a score-guided regularization to learn and enlarge the anomaly score disparities between normal and abnormal data. With such score-guided strategy, the representation learner can gradually learn more informative representation during the model training stage, especially for the samples in the transition field. We next propose a score-guided autoencoder (SG-AE), incorporating the scoring network into an autoencoder framework for anomaly detection, as well as other three state-of-the-art models, to further demonstrate the effectiveness and transferability of the design. Extensive experiments on both synthetic and real-world datasets demonstrate the state-of-the-art performance of these score-guided models (SGMs).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection aims to identify the samples that considerably deviate from the expectation of a complex system. There are plentiful applications of anomaly detection across various domains [10, 31, 25], for example, disease detection [45], network intrusion detection [15], and financial fraud detection [2]

, etc. Existing anomaly detection methods can be grouped into three categories according to the label availability in the training datasets: supervised learning methods, semi-supervised learning methods, and unsupervised learning methods. Supervised and semi-supervised learning methods rely on labeled data, which rarely exist in some practical applications. Accordingly, the unsupervised anomaly detection methods have received very much attention in both academic and industrial communities.

In general, the core idea of the unsupervised methods is to discover and utilize implicit or extra information to distinguish anomalies. Figure 1

presents the typical pipeline of unsupervised methods for anomaly detection. A representation learner is first trained to map the original data, e.g., tabular, image, or document data, into the desired hypothesis space. Then a discriminator mostly uses various pre-defined metrics to identify the abnormal samples. Based on the definition of the discriminator, we can further classify the unsupervised anomaly detection methods into three categories: distribution-based methods, distance-based methods, and reconstruction-based methods 

[35]. The distribution-based methods identify anomalies by assuming the normal data follow a certain distribution; the distance-based methods attempt to enhance the distance between the abnormal and normal data in a latent space; the reconstruction-based methods identify anomalies by the similarity between original data and reconstructed data. The final identification of anomaly always solely relies on the pre-defined distance calculated between the original samples or their representations in these existing methods, as illustrated in Figure 1.

Most existing unsupervised methods face the following challenges: (1) Clean data with only normal samples are required to train their models [16, 36, 51, 17, 8]. In these cases, the contaminated data with both normal and abnormal samples, which commonly exist in real-world applications, will hinder the models from fitting normal data, resulting in performance degradation. (2) The manually selected metric to compute the anomaly scores will impact the performance and can hardly ”one-model-fits-all” [50, 51, 23]. (3) Data distribution can be divided into three fields: obvious-normal field, transition field, and obvious-abnormal field, as shown in Figure 2. The samples in obvious-normal or obvious-abnormal fields can be clearly distinguished as normal or abnormal data. However, the mixture of normal and abnormal data in the transition field hampers the identification of the abnormal samples.

Fig. 1: The typical pipeline of unsupervised anomaly detection and the proposed scoring network. Different from previous unsupervised anomaly detection methods that calculate anomaly scores with a pre-defined metric, our scoring network learns the metrics from data to discriminate anomalies and enlarges the anomaly score disparity. The proposed scoring network can be embedded into existing methods without additional assumptions. The representation learner is simultaneously trained with the scoring network. That is, the scoring network guides the representation learner to learn the information with better discrimination ability from the input data.
Fig. 2: A toy example to illustrate the concepts of obvious-normal, obvious-abnormal, and transition fields.

To overcome these challenges, we propose a scoring network to promote the performance of anomaly identification via learning and enhancing the disparity between the normal and abnormal data. As shown in Figure 1, the scoring network is connected with the representation learner, which maps the original data into a hypothesis space. The scoring network guides the training of the representation learner to extract valuable information, which is more helpful for discrimination, from the data in the obvious-normal and obvious-abnormal fields. Specifically, we introduce a score-guided regularization in the scoring network to assign smaller scores to the obvious-normal data and larger scores to the obvious-abnormal data, making full use of the obvious data and enhancing the optimization of the representation learner. Thus the normal and abnormal data in the transition field will be guided to different directions, and the difference is expected to gradually increase during the model training. Moreover, the scoring network is devised to directly learn the discrimination metrics and output the anomaly scores in an end-to-end fashion, instead of relying on pre-defined distance metrics or reconstruction errors, thereby can be flexibly embedded into most existing anomaly detection models.

We further propose a novel instantiation, embedding the scoring network into an autoencoder framework, named score-guided autoencoder (SG-AE). The autoencoder reconstructs data instances and learns their representations in a latent space. Then, the scoring network assigns scores to these representations and attempt to maximize the disparity of the normal and abnormal samples. A sample with a higher anomaly score is more likely to be an anomaly. To explain how the scoring network works, we conduct simulation experiments to observe the changes of the score differences in the transition field. Also, we compare the performances of SG-AE with classic and state-of-the-art methods on seven tabular datasets. We also present the distributions of anomaly scores of normal and abnormal data and utilize t-SNE to reveal the anomaly detection performance in a 2D space. Moreover, we extend the scoring network to three state-of-the-art models and explore the performance improvement of these score-guided models (SG-Models). We also examine the potential of using scoring network on two document datasets and one image dataset. The main contributions of this work are summarized as follows:

  • We propose a scoring network with score-guided regularization, which utilizes the obvious data to strengthen the representation learner and expand the scores disparity between normal and abnormal samples in the transition field, thus enhance the anomaly detection ability. The scoring network can directly learn the anomaly scores and be easily integrated into most unsupervised models without imposing more assumptions.

  • We integrate the scoring network into an autoencoder structure and introduce a simple but effective instantiation, SG-AE. SG-AE breaks the limitation of the autoencoder on normal data input and achieves competitive performance on several datasets. We also incorporate the scoring network to three state-of-the-art models to examine the adaptability to different methods.

  • Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness of the scoring network. The SG-AE outperforms the classic and state-of-the-art methods and the other SG-Models show performance improvements with the help of scoring network. Moreover, our scoring network is robust to the anomaly rate by learning the disparity between normal and abnormal data, which is also demonstrated by the experiments.

2 Related Work

In this section, we briefly review the related work in unsupervised anomaly detection and revisit the use of regularization techniques in anomaly detection. Then we compare our design with the existing methods.

2.1 Unsupervised Anomaly Detection

Distribution-based methods. Distribution-based methods assume that normal data follow a certain distribution in the original or latent space and anomalies deviate from the distribution [14, 21, 49]

. The classic distribution-based methods, based on extreme-value analysis, consider the tail of a probability distribution as anomalies 

[1]

. In recent years, deep learning brings more possibilities with more distribution hypotheses in the hidden space. For instance, Deep Autoencoding Gaussian Mixture Model (DAGMM) 

[51]

assumes that the low-dimensional representations and reconstruction errors of normal data follow a Gaussian Mixture Model, and uses the probability value to distinguish between normal and abnormal data. Some one-class classification methods also fall into this category due to the consideration of manifold distribution. Deep SVDD 

[36] attempts to map the normal data into a hypersphere in a latent space and the abnormal data fall outside the hypersphere. DROCC [17] assumes that the normal data lie on a low-dimensional manifold and distinguishes data via Euclidean distance. It generates synthetic abnormal data to learn a robust boundary with a well-designed optimization algorithm. GOAD [8]

maps data into a sphere feature space with random affine transformations, and anomalies are assumed to deviate from the center. Some methods, like, Outlier Exposure 

[20]

, utilize extra information provided by auxiliary datasets to learn the normal distribution and detect the out-of-distribution samples. Note that although a number of generative methods assume various distributions in latent spaces, they finally determine anomalies based on the reconstruction errors. As such, we classify them into reconstruction-based methods.

Distance-based methods. Distance-based methods consider the positional relations or neighbor structures in the original or hidden space, assuming that anomalous samples stay far away from the normal ones [34, 18]. Nearest-neighbor-based methods, such as LOF [9] and DN2 [7], are typical of the distance-based methods. Likewise, the tree-based methods, like Isolation Forest (iForest) [26] and RRCT [19], can be regarded as distance-based methods since they attempt to capture high-density fields that are reflected in the depth or complexity of the tree [35]. Deep learning technologies are advantageous tools to learn data representation so that distance can be measured in a hypothesis space [30]. For example, Random Distance Prediction (RDP) [43]

builds deep neural networks to learn a certain random mapping, which can preserve the proximity information for data.

Reconstruction-based methods. Reconstruction-based methods train models to reconstruct data, with an assumption that the trained models learn the patterns of the major data and anomalies are unable to be well reconstructed. Autoencoder framework [5] is one fundamental architecture in reconstruction-based methods. Robust Deep Autoencoder (RDA) [50]

is an early work to build an autoencoder on corrupted data. It isolates noise and outliers with robust principal component analysis, and thereafter trains an autoencoder. Robust Subspace Recovery Autoencoder (RSRAE) 

[23]

constructs a robust subspace recovery layer within an autoencoder with a well-designed loss. Cosine similarity between original and reconstructed data is used to identify the abnormal samples. CoRA 

[42] is a transductive semi-supervised model, which modifies the autoencoder framework with one encoder and two decoders. The two decoders distinguish normal data from abnormal data by reconstruction errors and then decode them separately. Anomaly detection with generative models can also be considered as reconstruction-based methods. VAE adopts reconstruction probability to identify anomalies [3, 46, 25] while GAN-based methods detect anomalies by the discriminator due to the inaccurate reconstruction by the generator [38, 37, 24].

2.2 Anomaly Detection with Regularization

Regularization is a widely used technique to alleviate the over-fitting problem [40] and achieve adversarial robustness [44]. Many pieces of work utilize regularization to encourage autoencoder to learn an informative low dimensional representation of input data [4, 48, 33, 47]. In the context of anomaly detection, regularization methods are introduced to enforce robustness against anomalies, encouraging the model to learn key underlying regularities [11]. Motivated by robust principal component analysis (RPCA), RDA [50] adds sparsity penalty and group penalty

into its RPCA-like loss function of auto-encoder to improve robustness. RSRAE 

[23] further incorporates a special regularizer into the embedding of the encoder to enforce anomaly-robust linear structure. DAGMM [51] introduces penalty to constrain training. Some one-class classification methods also employ the regularization to better learn the boundary and improve performances. Ruff et al. presented a unified view of anomaly detection methods and comprehensively summarized the methods using regularizations [35].

2.3 Comparison with existing methods

Comparing to the existing work, the proposed method has the following characteristics to tackle the challenges mentioned in Section 1: (1) Many existing unsupervised models are only trained with normal samples [16, 36, 51, 17, 8]. Although some methods can deal with the contaminated data [50, 43, 23], they make efforts to retain the obvious-normal data and filter out the suspected abnormal data. Our method leverages both normal and abnormal samples in the datasets and attempts to enhance the detection capabilities in the transition field. Specifically, our score-guided regularization utilizes the obvious data to enlarge the score disparity between normal and abnormal data, and guide the training of the representation learner and the scoring network. By doing so, the proposed scoring network can improve the ability of the representation learner and the robustness of the entire detection method. (2) The existing unsupervised models define anomaly scores with selected appropriate metrics, such as Euclidean distance for distance-based methods [36, 30, 17] or cosine similarity for reconstruction-based methods [50, 38, 23]. The anomaly scores are not directly optimized in an end-to-end fashion. Devnet [32] is the first weakly supervised method to achieve end-to-end anomaly score learning. However, Devnet guides model training with labels that are missing in the unsupervised settings and treats the unknown anomalies as normal data. In contrast, our scoring network can be incorporated into existing unsupervised methods without additional assumptions, expanding their ability to handle contaminated data sets and directly optimize anomaly scores.

3 Methodology

In this section, we first introduce the score-guided regularization and then incorporate the scoring network into different unsupervised methods. These score-guided models are collectively called SG-Models (SGMs). Finally, the algorithm of SGMs is elaborated.

3.1 Score-guided Regularization

Let denote a dataset with samples where . The common practice of unsupervised anomaly detection methods is to choose one hypothesis mentioned in Section 2 to build a model. As shown in Figure 1, after the representation learner maps the data samples to a latent space, a pre-defined function need to be selected to identify the anomalous data in the latent space. According to the chosen hypothesis, can be distribution functions , reconstruction errors , or distance relations , as:

(1)

where is the learned data representation, is the reconstructed counterpart of .

The function values of anomalies deviate from the major data, and a larger value suggests a higher probability of being an anomaly. However, selecting or designing an effective requires prior domain knowledge about the tasks and there is almost no universal metric to fit various datasets. Therefore, an intuitive practice is to directly learn the metrics through a neural network and enlarge the deviation between normal and abnormal data to improve the detection capability. However, the prerequisite for the correct expansion of the score disparity is that we can distinguish between the normal and abnormal data, which seems to be a paradox. It is not a simple task for the scoring network to assign correct scores to samples in the unsupervised settings, especially to the samples in the transition field.

In fact, we can use the discrimination metric as a self-supervised signal to guide the learning of anomaly score. Supposing that the metric has the ability to detect the obvious-normal and obvious-abnormal data, the scoring network are expected to guide the score distribution of the obvious data first and then gradually try to guide the non-obvious data in the transition field. Specifically, these obvious data are effortless to identify, thus their anomaly scores can be guided faster to converge to the expected values. The representation learner extracts more valuable information from the obvious data, and the scoring network can more efficiently capture differences from data representations. Although these non-obvious data are difficult to distinguish and the anomaly scores of them might even be guided in the wrong direction in the early stage of training, the whole model will gradually be optimized and be capable of identifying the correct guidance direction. Finally, these non-obvious data are expected to be better distinguished.

To achieve this, we propose a scoring network with a score-guided regularization to learn and enlarge the anomaly score disparity and to enhance the capability of the representation learner, as shown in Figure 1. Supposing is the representation of a data sample in the latent space, the scoring network is devised to take as input and output an anomaly score in an end-to-end manner. As expected to assign smaller anomaly scores to obvious-normal samples and larger scores to obvious-abnormal samples, the regularization function can be defined as follows:

(2)

where is the aforementioned discrimination function, is the learned anomaly score, and is a weight parameter to adjust the effect of score guidance for anomalies. and are thresholds to divide obvious-normal and obvious-abnormal fields, as shown in Figure 2. As the training progresses, the anomaly scores of obvious-normal samples are expected to be zero and the scores of obvious-abnormal samples are expected to be large than value . However, the selection of the two thresholds is difficult in unsupervised settings. Considering that most samples in a dataset are normal, we apply only one threshold to divide the obvious-normal field and the suspected abnormal field. Then the regularization function is revised as follows:

(3)

We set a very small positive value approaching zero to be the target anomaly scores for the obvious-normal data, due to that the zero scores will force most weights of the scoring network to be . Although the normal data in the transition field is regarded as the suspected abnormal data, they will be gradually separated from the abnormal data with the knowledge learned from the obvious-normal data. From another perspective, the regularization function can be rewritten as follows:

(4)

where is the the normal data part and is the suspected abnormal data part.

The score-guided regularization is not constrained by new assumptions, so it can be integrated with existing unsupervised methods. Let be the loss function of an existing unsupervised model, the total loss function of the integrated model with score-guided regularization is

(5)

where

is a hyperparameter to adjust the effect of score-guided regularization.

Paremeter analysis. The score-guided regularization introduces four parameters: , , and , which seems to make it difficult to determine the appropriate parameter values in real-life applications. Here we analyze the real impact of these parameters. The parameter adjusts the effect of the score-guided regularization to the origin loss and adjusts the effect of the suspected abnormal part. The real impact of the two can rewrite the Eq. (5) as:

(6)

where and adjust the effect of the normal data part and the suspected abnormal data part. The parameter and have different value ranges but have the same meaning and influence as the parameter and . When we incorporate the scoring network with the autoencoder as mentioned in the next subsection, the in Eq. (6) changes to the reconstruction error , and the total loss function has a similar form with DAGMM [51] and RSRAE [23], both of which also have two to adjust the regularization effect. The parameter determines the guided-position for anomaly scores. Intuitively, a large can better enlarge the score gap between the normal and abnormal data. However, a too-large will only lengthen the distribution of scores, and the performance improvement will stabilize with the increase of . Because the anomaly score is learned by the scoring network and determines the effect of the regularizer together with , the actual impact of is small, not requiring adjustment, therefore we set in all experiments. The parameter determines the rate of normal and suspected abnormal data. For convenience, we can select a percentile, denoted as , of the values of , to find the corresponding . The parameter can be smaller than the normal data ratio, because the scoring network mainly use the obvious-normal data to guide the score distribution in transition field, which has been confirmed in experiments. Notably, the distribution of anomaly scores can be a signal to reflect the model performance and help parameter selection since the anomaly scores are directly learned and guided by the scoring network. We can also design an early stop mechanism based on the constraint of anomaly scores.

The score-guided regularizer is easy to combine limited labels by using labels instead of for the labelled data. In this case, the regularizer does not rely entirely on labels to guide the anomaly scores like DevNet, due to the use of self-supervised signals.

3.2 Score-guided Models

As mentioned above, we can integrate the scoring network into unsupervised anomaly detection methods. The score-guided methods can be collectively called SG-Models (SGMs). Here, we first propose SG-AE, an instantiation of applying scoring network to one reconstruction-based method, and then introduce a general form of SGMs.

As illustrated in Figure 3, SG-AE consists of a reconstructor and a score guider. Given a data sample , the reconstructor first maps it to a representation

in a latent space and then generate an estimation

. The score guider learns an anomaly score using as its input. In practice, we adopt an autoencoder framework as our reconstructor and a fully connected network in score guider. For tabular data, we also use fully connected networks in the autoencoder. It’s noteworthy that this autoencoder network is easy to transfer to other data types. For example, we can utilize recurrent networks for sequence data and convolutional networks for image data [32, 23].

Fig. 3: The proposed score-guided autoencoder (SG-AE).

Let and denote the encoder and decoder in an autoencoder, respectively. We denote the score networking as , and thus have

(7)

where is the input data; is the reconstructed counterpart of ; is the latent representation; and is the anomaly score.

The end-to-end network SG-AE can be represented as :

(8)

We can obtain the parameters of by minimizing a loss function, which is a combination of two parts: the reconstruction loss and the score-guidance loss. For the autoencoder, we use the loss to assess the reconstruction,

(9)

In the reconstruction-based settings, we also utilize the reconstruction error as a self-supervised signal, that is, we use as in Eq. (3). Thus, we can rewrite the score-guided regularization in (3) and define the score-guidance loss as,

(10)

We finally define the overall loss function of SG-AE as the combination of the two loss items in Eq. (9) and (10), that is,

(11)

Our goal is to minimize the loss function (11). Lower reconstruction loss in Eq. (9) ensures better reconstructions and representations of normal data. As the score-guidance loss decreases, the anomaly scores of normal data approach zero and the scores of suspected abnormal data tend to , thus the score disparities between normal and abnormal data continue to widen. A higher anomaly score of a sample indicates higher possibility to be a anomaly. By thresholding the anomaly scores, one can distinguish between normal and abnormal data samples.

In addition to SG-AE with reconstruction-based assumption, the scoring network can also be applied to other unsupervised methods, such as distance-based methods and distribution-based methods. Here, we discuss a general form of combining the scoring network with unsupervised methods. Let be a representation learner in an unsupervised method. With the discrimination metric , the general form of an unsupervised method is:

(12)

Combined with the scoring network , the general form of SGMs is:

(13)

By optimizing the total loss function 5, the score-guided regularization can encourage the SGMs learn and guide the anomaly scores, and also help the representation learner of the original model to learn better data representations. To better understand the scalability, we extend the scoring network to three state-of-the-art models, DAGMM [51], RDA [50] and RSRAE [23], and evaluate the performance in the experiments.

3.3 The SGM Algorithm

We illustrate the training process of SGMs in Algorithm 1. The parameters of a SGM are initialized randomly and optimized in the training iteration (Steps 2-10) to minimize the loss in Eq. (5). Specifically, the data representation is learned by the representation learner , through in Step 5. Then the scoring network learns the anomaly scores through in Step 6. Then Step 7 calculates the loss function (5

) and Step 8 takes a backpropagation algorithm and updates the network paramereters. Data representation and anomaly score distribution are optimized during training. Given a trained SGM, anomaly scores can be directly calculated given new data samples.

0:  Training set ,
0:  SGM :An optimized anomaly detection network
1:  Randomly initialize the network parameters of ,
2:  for

 each epoch 

do
3:     Divide input data into batches
4:     for each batch do
5:        
6:        
7:        Calculate loss function
8:        Backpropagate and update the parameters of ,
9:     end for
10:  end for
11:  return  Optimized SGM
Algorithm 1 Training SGMs

4 Experiments

In this section, we empirically evaluate the effectiveness and robustness of the SGMs on both synthetic and real-world datasets.

4.1 Simulation Experiments

The intention of the score-guidance design is to enlarge the disparity between the normal and abnormal samples, especially for those fall into the transition fields, during the training process. To verify that, we conduct simulation experiments to observe the change of anomaly scores during the model training. We compare our SG-AE model with AutoEncoder (AE) [5] and RDP [43] in three artificial datasets.

Data generation. As illustrated in Figure 4

, we generate a one-dimensional dataset and two two-dimensional datasets. The normal data are shown in blue while the abnormal data are in red. We utilize different equations to generate datasets of different complexity. For the one-dimensional scenario, we sample data points from Gaussian distribution

in geometric coordinates. For the two-dimensional scenarios, we generate data via controlling two variables in the polar coordinates and then transform them into geometric coordinates. One variable follows a Gaussian distribution and the other follows a uniform distribution. The normal and abnormal data are sampled with different mean

and same variance

. The detailed equations are shown in Figure 4.

Settings. The training and testing sets are independently generated, each with 10,000 data samples. The ratio of the abnormal to normal data is one to nine. In the testing sets, we use three values to divide each dataset into four parts , as shown in Figure 4. Based on the 3- rule, the thresholds are chosen with Gaussian parameters and : . Most of the data in are normal, namely obvious-normal. Similarly, is regarded as obvious-abnormal; and are two transition fields. In each field, we calculate the difference between the average anomaly scores of abnormal data and normal data. That is,

(14)

where is the score difference in , and are the numbers of abnormal and normal samples in ,

is the anomaly score of a sample. The score differences reflect the ability to distinguish between normal and abnormal data. We then use the Area Under Receiver Operating Characteristic Curve (AUC-ROC) to evaluate the model performances. The AE and SG-AE share the same Adam optimizer, learning rate, and fully connected networks. The layer sizes are set as 20 for the one-dimensional dataset and (64, 20) for the two-dimensional datasets. The scoring network in SG-AE is also a fully connected network with layer size (20, 1). The RDP keeps the originally proposed settings. The batch size and the training epoch are 1024 and 100, respectively.

Fig. 4: Simulation experiments in three synthetic datasets. represent fields with different ratio of anomaly data. For SG-AE, the score difference between normal and abnormal data in each field gradually expands with the training process.

Results analysis. As shown in Figure 4, for AE, the score differences decrease during training, suggesting that AE attempts to reconstruct the abnormal data. The continuous decrease of score differences and mean that AE starts to fail in distinguishing the normal and abnormal data in the transition field. For RDP, it uses a boosting process to filter anomalies and the iteration size is set as 8, thus the curves of score difference are jagged and the total epochs are 800. The changes of and are not obvious. For SG-AE, the score difference in each field is increasing during training, suggesting that the model reacts differently to normal and abnormal data. Specifically, the abnormal data are guided to obtain a larger anomaly score, while the normal data to a smaller anomaly score. More important, as expected, the increase of and indicates the improvement of the detection ability of the model in the transition field. Moreover, the SG-AE achieves the best AUC-ROC performances in the three datasets. The results demonstrate the effectiveness of the scoring network.

4.2 Evaluation Experiments

4.2.1 Datasets Description

As shown in Table I, we use seven publicly available tabular datasets in our experiments. Two datasets are in the healthcare domain, e.g., diagnosis of breast cancer (bcsc[6] and diabetic detection (diabetic[41]. Another two datasets are related to attack in cybersecurity, e.g., intrusion [13] and attack [29]; and Market includes data of potential subscribers in bank marketing [29]. Creditcard is a credit-card fraud dataset. Donor is a valuable projects selection dataset. Details of these two datasets can be found in [32].

Before Preprocessing After Preprocessing
Data (%) (%) (%)
Attack 257673 39 3 0.00 0.00 75965 191 342 25.24
Bcsc 462563 14 0 17.87 0.00 382122 14 2242 2.62
Creditcard 284807 30 0 0.38 0.00 285441 30 1716 0.17
Diabetic 101766 12 31 0.00 3.65 98575 115 522 12.10
Donor 664098 6 16 0.04 12.02 587383 79 3317 6.28
Intrusion 805050 38 3 73.22 0.00 216420 119 820 37.23
Market 45211 7 9 0.00 0.00 45451 51 241 11.70
TABLE I: Statistical information of datasets. is the original size. and are dimensions of numerical and categorical features. and are the rates of duplicate and missing data. and are the size and dimension after preprocessing. is the number of injected noise. is the rate of anomalies.

Data Preprocessing. We first preprocess the datasets via removing duplicated samples or samples with missing features. For example, 73.22% of duplicates are found on intrusion, and 12.02% of missing data are found on donor

. Next, we encode the categorical features with one-hot encoding and standardize each numerical feature. In addition, following 

[32] we inject noises into the training data to improve the robustness. Specifically, we randomly select 1% of normal data and swap 5% features of these data. Finally, the training, validation, testing sets are randomly divided into 6/2/2. Note that some datasets have a part of features that leak anomaly labels. To keep consistency in real applications, we drop one feature in attack and five features in donor to avoid using this supervised-like information.

4.2.2 Benchmarks and Settings

We compare SG-AE with five benchmarks: iForest [26], RDA [50], DAGMM [51], RDP [43], RSRAE [23]. We use the originally proposed model structures for iForest, DAGMM, and RDP. As for RDA, RSRAE, and SG-AE, we use the same fully connected layer settings in the autoencoder framework to maintain the consistency. To adapt the autoencoder framework to various dataset dimensions (), we select the sizes of the fully connected layers in the encoder from (20, 40, 80), for example, size (, 20) for creditcard, size (, 40, 20) for market and size (, 80, 40, 20) for intrusion. The network sizes in the decoder are symmetrical to the encoder. The scoring network in SG-AE uses the fully connected network with size (20, 10). In terms of parameter settings, we keep the original recommendations in iForest. To make the deep learning models adaptable for each dataset and arrive at a convincing conclusion, we conduct parameter searching to obtain optimal results, starting from the recommended parameters. Specifically, RDA is tuned with and update step, while DAGMM is tuned with two and mixture components number. RSRAE is tuned with two and intrinsic dimension. RDP is tuned with node epoch, tree depth, and latent dimension. RDP set tree number to 1 rather than in ensemble settings for fair comparisons. SG-AE is tuned with two and . Additionally, we keep the same training settings for all deep learning models. We set batch size as , the training epochs as , and the Adam optimizer with a learning rate of . We select the well-trained models in the validation sets. Each training task takes from dozens of minutes to a few hours on one NVIDIA RTX 2080 Ti GPU. The implementation code is publicly released at https://github.com/urbanmobility/SGM.

4.2.3 Performance Evaluation

We use the Area Under Receiver Operating Characteristic Curve (AUC-ROC) and the Area Under Precision-Recall Curve (AUC-PR) to evaluate the performance of reference methods. Higher AUC values suggest better performance. The results are averaged over 10 independent runs, in which the data sets are regenerated randomly, to report stable performance. As shown in Table II, the results reflect that our SG-AE outperforms other methods in all datasets. SG-AE achieves more than 5% AUC-ROC/AUC-PR improvement over the best baselines on attack, intrusion, market. The reliance on clean training data for DAGMM leads to performance degradation in the anomaly-contained settings. RDP and RDA utilize different strategies to filter anomalies during training, achieving the second and third performance in most datasets. RSRAE, originally designed for image data, achieves unsatisfactory performance, reflecting that it cannot be well adapted to tabular data. The performance improvement of SG-AE benefits from the score-guidance strategy, which enhances the score disparity and the capability of the representation learner. Compared with the results in [43], the performance difference is mainly caused by the construction of the datasets. For example, we obtained 22 features from the data sources in donor but [43] take only 10 features without detailed prepossessing steps. However, we can have a convincing comparison result in creditcard, which is a standard dataset and has been preprocessed in the original data source. On this dataset, iForest and RDP perform similarly in our work and [43].

AUC-ROC AUC-PR
Datasets iForest DAGMM RDP RDA RSRAE SG-AE iForest DAGMM RDP RDA RSEAE SG-AE
Attack 0.6440.028 0.7620.012 0.7690.016 0.6150.033 0.6810.035 0.8190.009 0.3430.027 0.4570.010 0.4420.020 0.3210.013 0.5330.046 0.5960.024
Bcsc 0.7540.033 0.6720.023 0.8940.003 0.8720.004 0.8400.023 0.9120.008 0.0800.026 0.0530.011 0.6050.024 0.6130.003 0.1370.048 0.8160.004
Creditcard 0.9520.003 0.8330.057 0.9530.005 0.9200.006 0.9420.017 0.9640.004 0.1840.031 0.1000.185 0.1870.032 0.2200.003 0.1580.054 0.2910.072
Diabetic 0.5460.008 0.5140.020 0.5410.041 0.5380.005 0.4770.004 0.5480.007 0.1360.003 0.1280.010 0.1350.017 0.1320.003 0.1170.001 0.1420.004
Donor 0.5090.005 0.5190.011 0.5180.021 0.5130.001 0.4840.009 0.5410.004 0.0660.001 0.0690.002 0.0670.005 0.0660.000 0.0600.001 0.0750.002
Intrusion 0.8150.012 0.8600.046 0.8630.022 0.6140.027 0.7410.022 0.9060.016 0.6190.017 0.8010.034 0.6660.027 0.4780.012 0.7470.042 0.9040.035
Market 0.6520.016 0.6400.013 0.6850.021 0.6740.006 0.5360.013 0.7500.025 0.1890.011 0.2110.007 0.2270.022 0.2100.006 0.1290.005 0.2700.030
TABLE II: AUC-ROC and AUC-PR (meanstd) of SG-AE and the reference models.

Figure 5

presents the score distribution of SG-AE. We can see that the anomaly score distributions of normal data (blue bars) and abnormal data (red bars) are notably separated on most datasets. For numerical comparison between the typical methods, we use KS test to reflect the distribution distance between normal and abnormal scores. KS test is a statistical hypothesis test that checks whether two samples differ in distributions 

[12]. A higher KS index indicates a larger score difference. As shown in Table III, the average values of the KS of SG-AE on the seven datasets are , surpassing the of AE, of AE, of RDA, and of RDP. This suggests that our scoring network successfully enhanced the disparity between the normal and abnormal data. A larger KS index correlates with better performance showed in Table II, especially on the bcsc, creditcard, intrusion datasets. Taken together, these results demonstrate the effectiveness of the score guidance design.

Fig. 5: The anomaly score distribution of actual normal and abnormal data on the seven datasets. As expected, the SG-AE learns relative large anomaly scores for the abnormal data (red), but small scores for the normal data (blue).
KS index Attack Bcsc Creditcard Diabetic Donor Intrusion Market
iForest 0.207 0.456 0.805 0.076 0.035 0.717 0.188
AE 0.224 0.089 0.857 0.082 0.086 0.843 0.422
RDA 0.211 0.790 0.782 0.064 0.024 0.214 0.265
RDP 0.346 0.741 0.870 0.103 0.056 0.450 0.340
SG-AE 0.570 0.791 0.882 0.114 0.089 0.864 0.509
TABLE III: KS index between the anomaly score distributions of the normal and abnormal data in various models.

To intuitively understand the anomaly detection results, we utilize t-SNE for dimensionality reduction and visualize the seven datasets in 2D space, as shown in Figure 6. Data samples are colored with their true labels or anomaly scores generated by our SG-AE and RDP, which performs best among the baselines. The light color indicates these samples can not be easily distinguished as normal or abnormal data. One can select different thresholds to identify the abnormal samples, according to the datasets and needs. In Figure 6, we observe that t-SNE can well segregate the anomalies in the 2D space on half of the datasets, and SG-AE presents more similar segregation to the ground truth than RDP, especially on bcsc and intrusion. These results are consistent with the results in Table II, reflecting the anomaly detection capabilities of SG-AE. We also notice that the anomalies are not segregated well on diabetic and donor, indicating the challenge of anomaly detection.

Fig. 6: The t-SNE visualization analysis. The three rows refer to the ground truth (GT), results of RDP and SG-AE, respectively. Red represents abnormal, while blue represents normal. The darker color indicates the larger probability to be normal or abnormal.

4.2.4 Comparison with Variants

In SG-AE, we introduce the score-guidance strategy to anomaly scores with the reconstruction error as a self-supervised signal. To examine the effect of the learnable anomaly score, we remove the scoring network and use the reconstruction error as the anomaly score, keeping the regularizer in Eq. (11). That means regularizer is used to guide the reconstruction error. This variant of SG-AE is named as SG-AE. We also test AE, the original autoencoder without the score-guided loss and the scoring network.

We next devise the other two variants, SG-AE and SG-AE, to examine the score distribution assumption. The anomaly scores appear to have a normal or lognormal distribution in the statistical histogram displayed in Figure 5

. Inspired by this, we assume the anomaly scores follow a normal or lognormal distribution in our SG-AE. The outputs of the scoring network are altered to the mean and standard deviation of scores, denoted by

and . We utilize KL divergence to measure the distribution difference, thus the score guided regularization (10) can be written as follows.

(15)

Table IV compares the performance of the original SG-AE and its variants. The original SG-AE performs best on five datasets. SG-AE and SG-AE have similar performances, which are slightly worse than SG-AE in most cases. SG-AE achieves the best performances on diabetic and donor. However, the performance of SG-AE has large variances across datasets, which is caused by the conflict of these two loss functions. The reconstruction loss tends to reduce the reconstruction error, but the score guidance loss attempt to enlarge the reconstruction error for anomalies. This conflict brings challenges to the update of network states. The results of SG-AE, SG-AE, SG-AE are significantly better than AE, which demonstrates the effectiveness of the scoring network.

AUC-ROC AUC-PR
Datasets SG-AE SG-AE SG-AE SG-AE AE SG-AE SG-AE SG-AE SG-AE AE
Attack 0.8190.009 0.7720.06 0.7450.09 0.6390.112 0.5610.037 0.5960.024 0.4360.0724 0.4120.079 0.4360.106 0.3030.016
Bcsc 0.9120.008 0.8970.006 0.8980.006 0.6250.288 0.8780.010 0.8160.004 0.7810.050 0.7510.078 0.2290.281 0.6170.004
Creditcard 0.9640.004 0.9400.017 0.9280.027 0.7160.158 0.9030.013 0.2910.072 0.1370.060 0.1280.046 0.0840.107 0.2100.004
Diabetic 0.5480.007 0.5400.015 0.5410.013 0.5590.077 0.5350.007 0.1420.004 0.1330.009 0.1380.005 0.1560.043 0.1310.003
Donor 0.5410.004 0.5320.006 0.5360.006 0.5590.063 0.4860.045 0.0750.002 0.0720.002 0.0730.002 0.0760.012 0.0630.007
Intrusion 0.9060.016 0.7120.077 0.6950.037 0.7480.312 0.5680.036 0.9040.035 0.5460.055 0.5410.024 0.7300.242 0.4610.017
Market 0.7500.025 0.647 0.6440.013 0.6000.076 0.6770.008 0.2700.030 0.1910.009 0.1870.006 0.1910.065 0.2150.007
TABLE IV: AUC-ROC and AUC-PR (mean±std) of SG-AE and its variants.

4.2.5 Sensitivity of Parameters

In this part, we examine the sensitivity of SG-AE to different parameters. For convenience, we divide the four parameters into two groups, and in group one, and in group two. We first test the group one by fixing and , then test the group two with fixed and .

Fig. 7: The AUC-ROC values for SG-AE with various sets of parameters on three datasets, attack, credictcard, and martket. We first fix and for each dataset and train the SG-AE with different values of and . Then we fix and , and train the SG-AE with different values of and .

Figure 7 reports the AUC-ROC values of the two-step testing on three datasets, attack, creditcard, and market. The darker colors suggest better performance. We note that the state-of-the-art results can be achieved in two rounds of parameter searching, which indicates that the parameter searching is not very complicated or time-consuming. In addition, we find that (i) and have relatively greater impact; (ii) can be smaller than the normal data ratio, especially with a little prior knowledge of the dataset; (iii) in consistent with the the analysis in methodology, the performance of SG-AE changes lightly when ; (iv) the sensitivity also behaves differently for these datasets owning to the complexity and characteristics.

4.2.6 Robustness to Different Anomaly Rates

As we intend to learn the disparity between the normal and abnormal data using a scoring network, we can expect that the performance of SG-AE would be robust to the ratio of the abnormal data. To validate this expectation, we reorganize the two datasets with high anomaly rates, attack and intrusion, and compare SG-AE with iForest, AE, RDP on these datasets with different anomaly rates. Specifically, we randomly drop anomaly samples in the training set, adjusting the anomaly rate to on attack and on intrusion. The validation and testing sets keep the original anomaly rate.

Figure 8 presents the comparison results in different anomaly rates. Obviously, SG-AE achieves a more stable performance than other competing models, while AE has the fast AUC-ROC performance degradation on both datasets. The strong sensitivity of AE to the anomaly rate mainly due to its intention to reconstruct all input data, while a large fraction of abnormal data hinders the training of the reconstruction model. We also observe that the AUC-ROC values of iForest and RDP drop faster in attack than in intrusion, suggesting that the robustness to anomaly rates also depends on the characteristic of data.

Fig. 8: The AUC-ROC values of selected models in different anomaly rates on two datasets attack and intrusion. SG-AE achieves stable AUC-ROC with various anomaly rates.

4.3 Extension to SOTA models

To examine the transferability of the scoring network to other models, we evaluate the performances of three extended models in the seven tabular datasets.

Settings. We extend the scoring network to three state-of-the-art models: DAGMM, RDA, and RSRAE. The extended models are named score-guided DAGMM (SG-DAGMM), score-guided RDA (SG-RDA), and score-guided RSRAE (SG-RSRAE). For the input of the scoring network, SG-DAGMM takes the same input as the compression network in DAGMM, SG-RDA takes the output of the encoder, and SG-RSRAE takes the output of the RSR layer. Here the datasets are configured same as the experiments of SG-AE. We also search parameters to adapt the SG-Models to these datasets. Specifically, SG-RDA is tuned with its original and , SG-DAGMM and SG-RSRAE are tuned with their two original and .

Performance evaluation. The results of AUC-ROC and AUC-PR are shown in table V. Compared with the original model, these SG-Models achieve varying degrees of improvement in most datasets. These results suggest that the score-guidance strategy can better handle the contaminated data. In addition, SG-DAGMM obtains more obvious performance improvements and surpasses SG-AE on some datasets. This is mainly because the scoring network enables DAGMM to deal with the anomaly contamination and generates better data representation for the Gaussian Mixture Model. These experimental results not only demonstrate the effectiveness of the scoring network design, but also reflect that the scoring network can be incorporated into different unsupervised anomaly detection models.

AUC-ROC AUC-PR
Datasets SG-AE SG-RDA SG-RSRAE SG-DAGMM SG-AE SG-RDA SG-RSRAE SG-DAGMM
Attack 0.8190.009 0.8160.016 (32.7%) 0.7710.064 (13.2%) 0.7830.044 (44.4%) 0.5960.024 0.5340.027 (66.4%) 0.5290.096 (-0.8%) 0.6600.035 (2.8%)
Bcsc 0.9120.008 0.9050.004 (3.8%) 0.8680.043 (3.3%) 0.7850.117 (16.8%) 0.8160.004 0.8050.001 (31.3%) 0.5100.219 (272.3%) 0.4210.380 (694.3%)
Creditcard 0.9640.004 0.9400.010 (2.2%) 0.9300.019 (-1.3%) 0.9150.021 (9.8%) 0.2910.072 0.2020.083 (-7.3%) 0.2220.100 (40.5%) 0.4940.169 (394.0%)
Diabetic 0.5480.007 0.5610.027 (4.3%) 0.5450.012 (14.3%) 0.5810.066 (13.0%) 0.1420.004 0.1430.013 (8.3%) 0.1350.005 (15.4%) 0.1680.049 (31.3%)
Donor 0.5410.004 0.5400.007 (5.3%) 0.5970.059 (23.3%) 0.5680.049 (9.4%) 0.0750.002 0.0780.003 (18.2%) 0.0810.011 (35.0%) 0.0760.005 (10.1%)
Intrusion 0.9060.016 0.8990.019 (46.4%) 0.8080.144 (9.0%) 0.9760.003 (13.5%) 0.9040.035 0.7580.030 (58.6%) 0.7360.188 (-1.5%) 0.9710.003 (21.2%)
Market 0.7500.025 0.7470.033 (10.8%) 0.7330.030 (36.8%) 0.6980.003 (9.1%) 0.2700.030 0.2510.031 (19.5%) 0.2380.033 (84.5%) 0.2030.011 (-3.8%)
TABLE V: AUC-ROC and AUC-PR (meanstd) of SG-SOTA models.

4.4 Extension to document and image tasks

To examine the transferability to different tasks, we compare SG-AE with iForest, DAGMM, RDA, RSRAE, and RDP on document and image datasets. Settings. For the document task, we utilize datasets, news and reuters [23]. News involves newsgroup documents with 20 different labels and reuters contains 5 text categories. Following the preprocessing steps in [23], news and reuters

are randomly split into 360 documents per class and the documents are embedded into vectors of size

and , respectively. In each round of testing, we choose one class as normal data in turns and randomly select abnormal samples from other classes. By collecting different numbers of abnormal data, we compare the SG-AE with baselines in the document datasets with varying anomaly rates, . For image task, we take mnist dataset as an example [50]. mnist consists of 5124 instances, with the anomaly rate of . The normal data are the image of digit ”4” and the abnormal data are other digits. For each task, we average the AUC-ROC and AUC-PR values over 10 independent runs. The training and testing sets are 8/2. The layer sizes of the autoencoder framework are [1024, 256, 64, 20] in document tasks and [128, 64, 32] in the image task. The batch size is set as 32 and other settings are consistent with the previous experiments. We also conduct parameter searching to achieve nearly optimal results.

Fig. 9: AUC-ROC and AUC-PR values on document data.

Performance evaluation. Figure 9 illustrates the results for document datasets. SG-AE achieves the best performance in reuters. In 20news, SG-AE performs slightly weaker than RSRAE and RDA at a smaller anomaly rate but performs better at a larger anomaly rate. This might be due to the different granularity when handling anomalies. Specifically, RSRAE and RDA filter abnormal parts for each sample at the feature level, which is effective for datasets with large feature correlation (such as document or image datasets). SG-AE simply guides the sample distribution at the data level. In fact, the feature-level and the data-level strategies are not in conflict. The combination of these two strategies can be further studied in the future. The results for the image dataset are shown in Table VI. Similar to the document datasets, SG-AE has competitive AUC-ROC, but performs weaker than RSRAE in term of AUC-PR. Therefore, we further evaluate SG-RSRAE, achieving performance improvements of on AUC-ROC and on AUC-PR over RSRAE. These results confirm the effectiveness of the score-guidance strategy in different tasks.

Metrics AUC-ROC AUC-PR
iForest 0.8730.014 0.3720.036
DAGMM 0.7580.040 0.2350.036
RDP 0.8880.024 0.4750.065
RDA 0.9120.010 0.5560.038
RSRAE 0.9390.007 0.6150.026
SG-AE 0.9390.005 0.5630.015
SG-RSRAE 0.9510.010 0.7360.062
TABLE VI: AUC-ROC and AUC-PR values on mnist dataset.

5 Conclusion and future work

Targeting the unsupervised anomaly detection task, this work devised an effective scoring network with score-guided regularization. The scoring network is able to learn the disparity of normal and abnormal data and the disparity is gradually enhanced during the network training. The scoring network can be integrated into different unsupervised anomaly detection methods. We next proposed a representative instantiation of incorporating the scoring network into an autoencoder framework, namely score-guided autoencoder (SG-AE). We first conducted experiments on synthetic datasets to examine the effectiveness of the score-guidance strategy. The results show that the anomaly scores disparity between normal and abnormal data continues to expand during training, especially in the transition field. Then comprehensive experiments are conducted on seven tabular datasets, suggesting that the proposed SG-AE is competitive with the state-of-the-art methods in terms of both AUC values and the Kolmogorov–Smirnov statistical index. We also analyzed the sensitivity of the introduced parameters in our scoring network and found that different datasets can share similar configurations. In addition, as the scoring network can learn the disparity between normal and abnormal data, we expect that SG-AE can work well even with a large fraction of anomaly data. To validate this expectation, we then tested the SG-AE on two datasets with varying anomaly rates, indicating that SG-AE is more robust to the anomaly rate than the other three baselines. In order to present the transferability of the scoring network on different unsupervised methods, we then applied the scoring network to three state-of-the-art methods and the experimental results approve the performance improvement. Moreover, we applied the scoring network to two document datasets and one image dataset to present the transferability to different anomaly detection tasks. Experimental results show that SG-Models is comparable to the state-of-the-art methods, especially for datasets with large anomaly rates.

There are several improvements and potential extensions that merit further study: (1) Adjusting the parameters to balance the effects of several loss function terms is challenging when applying the scoring network to a method with multiple loss functions. Potential solution could be utilizing the multi-objective optimization techniques [22, 27], which attempt to balance the trade-offs between several objectives. Hyperparameter tuning techniques [39, 28], which map hyperparameters into loss functions and optimize them during training, are also worthy of consideration to deal with the problem of too many hyperparameters. (2) The score-guidance strategy deserves to be further applied to convolutional frameworks or sequential frameworks to evaluate the performance changes in various anomaly detection tasks with image, time-series, or graph datasets.

Acknowledgments

This work was jointly supported by the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), the Science and Technology Commission of Shanghai Municipality Project (2051102600), and the National Key Research and Development Program of China (2020YFC2008701).

References

  • [1] C. C. Aggarwal (2015) Outlier analysis. In Data mining, pp. 237–263. Cited by: §2.1.
  • [2] M. Ahmed, N. Choudhury, and S. Uddin (2017) Anomaly detection on big data in financial markets. In 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 998–1001. Cited by: §1.
  • [3] J. An and S. Cho (2015) Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2 (1), pp. 1–18. Cited by: §2.1.
  • [4] D. Arpit, Y. Zhou, H. Ngo, and V. Govindaraju (2016) Why regularized auto-encoders learn sparse representation?. In

    International Conference on Machine Learning

    ,
    pp. 136–144. Cited by: §2.2.
  • [5] P. Baldi (2012) Autoencoders, unsupervised learning, and deep architectures. In

    Proceedings of ICML workshop on unsupervised and transfer learning

    ,
    pp. 37–49. Cited by: §2.1, §4.1.
  • [6] W. E. Barlow, E. White, R. Ballard-Barbash, P. M. Vacek, L. Titus-Ernstoff, P. A. Carney, J. A. Tice, D. S. Buist, B. M. Geller, R. Rosenberg, et al. (2006) Prospective breast cancer risk prediction model for women undergoing screening mammography. Journal of the National Cancer Institute 98 (17), pp. 1204–1214. Cited by: §4.2.1.
  • [7] L. Bergman, N. Cohen, and Y. Hoshen (2020) Deep nearest neighbor anomaly detection. arXiv preprint arXiv:2002.10445. Cited by: §2.1.
  • [8] L. Bergman and Y. Hoshen (2020) Classification-based anomaly detection for general data. arXiv preprint arXiv:2005.02359. Cited by: §1, §2.1, §2.3.
  • [9] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander (2000) LOF: identifying density-based local outliers. In ACM International Conference on Management of Data, pp. 93–104. Cited by: §2.1.
  • [10] R. Chalapathy and S. Chawla (2019) Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §1.
  • [11] R. Chalapathy, A. K. Menon, and S. Chawla (2017) Robust, deep and inductive anomaly detection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 36–51. Cited by: §2.2.
  • [12] D. M. dos Reis, P. Flach, S. Matwin, and G. Batista (2016) Fast unsupervised online drift detection using incremental kolmogorov-smirnov test. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1545–1554. Cited by: §4.2.3.
  • [13] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.2.1.
  • [14] E. Eskin (2000) Anomaly detection over noisy data using learned probability distributions. In In Proceedings of the International Conference on Machine Learning, Cited by: §2.1.
  • [15] F. Falcão, T. Zoppi, C. B. V. Silva, A. Santos, B. Fonseca, A. Ceccarelli, and A. Bondavalli (2019) Quantitative comparison of unsupervised anomaly detection algorithms for intrusion detection. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, pp. 318–327. Cited by: §1.
  • [16] I. Golan and R. El-Yaniv (2018) Deep anomaly detection using geometric transformations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 9781–9791. Cited by: §1, §2.3.
  • [17] S. Goyal, A. Raghunathan, M. Jain, H. V. Simhadri, and P. Jain (2020) DROCC: deep robust one-class classification. In International Conference on Machine Learning, pp. 3711–3721. Cited by: §1, §2.1, §2.3.
  • [18] X. Gu, L. Akoglu, and A. Rinaldo (2019) Statistical analysis of nearest neighbor methods for anomaly detection. arXiv preprint arXiv:1907.03813. Cited by: §2.1.
  • [19] S. Guha, N. Mishra, G. Roy, and O. Schrijvers (2016) Robust random cut forest based anomaly detection on streams. In International conference on machine learning, pp. 2712–2721. Cited by: §2.1.
  • [20] D. Hendrycks, M. Mazeika, and T. Dietterich (2018) Deep anomaly detection with outlier exposure. In ICLR, Cited by: §2.1.
  • [21] J. Kim and C. D. Scott (2012)

    Robust kernel density estimation

    .
    The Journal of Machine Learning Research 13 (1), pp. 2529–2565. Cited by: §2.1.
  • [22] M. Konakovic Lukovic, Y. Tian, and W. Matusik (2020) Diversity-guided multi-objective bayesian optimization with batch evaluations. Advances in Neural Information Processing Systems 33. Cited by: §5.
  • [23] C. Lai, D. Zou, and G. Lerman (2020) Robust subspace recovery layer for unsupervised anomaly detection. In ICLR, Cited by: §1, §2.1, §2.2, §2.3, §3.1, §3.2, §3.2, §4.2.2, §4.4.
  • [24] D. Li, D. Chen, J. Goh, and S. Ng (2018)

    Anomaly detection with generative adversarial networks for multivariate time series

    .
    arXiv preprint arXiv:1809.04758. Cited by: §2.1.
  • [25] L. Li, J. Yan, H. Wang, and Y. Jin (2020) Anomaly detection of time series with smoothness-inducing sequential variational auto-encoder. IEEE transactions on neural networks and learning systems. Cited by: §1, §2.1.
  • [26] F. T. Liu, K. M. Ting, and Z. Zhou (2008) Isolation forest. In 2008 eighth ieee international conference on data mining, pp. 413–422. Cited by: §2.1, §4.2.2.
  • [27] L. Liu, Y. Li, Z. Kuang, J. Xue, Y. Chen, W. Yang, Q. Liao, and W. Zhang (2021) Towards impartial multi-task learning. In International Conference on Learning Representations, Cited by: §5.
  • [28] M. Mackay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse (2019) Self-tuning networks: bilevel optimization of hyperparameters using structured best-response functions. In International Conference on Learning Representations, Cited by: §5.
  • [29] N. Moustafa and J. Slay (2015) UNSW-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pp. 1–6. Cited by: §4.2.1.
  • [30] G. Pang, L. Cao, L. Chen, and H. Liu (2018)

    Learning representations of ultrahigh-dimensional data for random distance-based outlier detection

    .
    In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 2041–2050. Cited by: §2.1, §2.3.
  • [31] G. Pang, C. Shen, L. Cao, and A. v. d. Hengel (2020) Deep learning for anomaly detection: a review. arXiv preprint arXiv:2007.02500. Cited by: §1.
  • [32] G. Pang, C. Shen, and A. van den Hengel (2019) Deep anomaly detection with deviation networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 353–362. Cited by: §2.3, §3.2, §4.2.1, §4.2.1.
  • [33] S. Qian, G. Li, W. Cao, C. Liu, S. Wu, and H. Wong (2019)

    Improving representation learning in autoencoders via multidimensional interpolation and dual regularizations.

    .
    In IJCAI, pp. 3268–3274. Cited by: §2.2.
  • [34] S. Ramaswamy, R. Rastogi, and K. Shim (2000) Efficient algorithms for mining outliers from large data sets. In ACM International Conference on Management of Data, pp. 427–438. Cited by: §2.1.
  • [35] L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K. Müller (2020) A unifying review of deep and shallow anomaly detection. arXiv preprint arXiv:2009.11732. Cited by: §1, §2.1, §2.2.
  • [36] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In International conference on machine learning, pp. 4393–4402. Cited by: §1, §2.1, §2.3.
  • [37] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli (2018)

    Adversarially learned one-class classifier for novelty detection

    .
    In CVPR, pp. 3379–3388. Cited by: §2.1.
  • [38] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pp. 146–157. Cited by: §2.1, §2.3.
  • [39] I. Shavitt and E. Segal (2018) Regularization learning networks: deep learning for tabular datasets. In Advances in Neural Information Processing Systems, pp. 1379–1389. Cited by: §5.
  • [40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.2.
  • [41] B. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J. Cios, and J. N. Clore (2014) Impact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international. Cited by: §4.2.1.
  • [42] K. Tian, S. Zhou, J. Fan, and J. Guan (2019) Learning competitive and discriminative reconstructions for anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5167–5174. Cited by: §2.1.
  • [43] H. Wang, G. Pang, C. Shen, and C. Ma (2020) Unsupervised representation learning by predicting random distances. In International Joint Conferences on Artificial Intelligence, Cited by: §2.1, §2.3, §4.1, §4.2.2, §4.2.3.
  • [44] Y. Wen, S. Li, and K. Jia (2020) Towards understanding the regularization of adversarial robustness on neural networks. In International Conference on Machine Learning, pp. 10225–10235. Cited by: §2.2.
  • [45] W. Wong, A. Moore, G. Cooper, and M. Wagner (2002) Rule-based anomaly pattern detection for detecting disease outbreaks. In AAAI/IAAI, pp. 217–223. Cited by: §1.
  • [46] H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei, Y. Feng, et al. (2018) Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 World Wide Web Conference, pp. 187–196. Cited by: §2.1.
  • [47] H. Xu, D. Luo, R. Henao, S. Shah, and L. Carin (2020) Learning autoencoders with relational regularization. In International Conference on Machine Learning, pp. 10576–10586. Cited by: §2.2.
  • [48] W. Yu, C. Zheng, W. Cheng, C. C. Aggarwal, D. Song, B. Zong, H. Chen, and W. Wang (2018) Learning deep network representations with adversarially regularized autoencoders. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2663–2671. Cited by: §2.2.
  • [49] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang (2016)

    Deep structured energy based models for anomaly detection

    .
    In The 33rd International Conference on Machine Learning, pp. 1100–1109. Cited by: §2.1.
  • [50] C. Zhou and R. C. Paffenroth (2017) Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 665–674. Cited by: §1, §2.1, §2.2, §2.3, §3.2, §4.2.2, §4.4.
  • [51] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018) Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In ICLR, Cited by: §1, §2.1, §2.2, §2.3, §3.1, §3.2, §4.2.2.