DeepAI
Log In Sign Up

Supervised Anomaly Detection based on Deep Autoregressive Density Estimators

04/12/2019
by   Tomoharu Iwata, et al.
0

We propose a supervised anomaly detection method based on neural density estimators, where the negative log likelihood is used for the anomaly score. Density estimators have been widely used for unsupervised anomaly detection. By the recent advance of deep learning, the density estimation performance has been greatly improved. However, the neural density estimators cannot exploit anomaly label information, which would be valuable for improving the anomaly detection performance. The proposed method effectively utilizes the anomaly label information by training the neural density estimator so that the likelihood of normal instances is maximized and the likelihood of anomalous instances is lower than that of the normal instances. We employ an autoregressive model for the neural density estimator, which enables us to calculate the likelihood exactly. With the experiments using 16 datasets, we demonstrate that the proposed method improves the anomaly detection performance with a few labeled anomalous instances, and achieves better performance than existing unsupervised and supervised anomaly detection methods.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/11/2019

Anomaly Detection with Inexact Labels

We propose a supervised anomaly detection method for data with inexact a...
03/16/2021

Flow-based Self-supervised Density Estimation for Anomalous Sound Detection

To develop a machine sound monitoring system, a method for detecting ano...
08/30/2022

Anomaly Detection using Contrastive Normalizing Flows

Detecting test data deviating from training data is a central problem fo...
07/19/2019

Batch Uniformization for Minimizing Maximum Anomaly Score of DNN-based Anomaly Detection in Sounds

Use of an autoencoder (AE) as a normal model is a state-of-the-art techn...
04/13/2020

Anomaly Detection in Trajectory Data with Normalizing Flows

The task of detecting anomalous data patterns is as important in practic...
09/20/2022

Collaborative Anomaly Detection

In recommendation systems, items are likely to be exposed to various use...

1 Introduction

Anomaly detection is an important task in artificial intelligence, which is a task to find anomalous instances in a dataset. The anomaly detection has been used in a wide variety of applications 

[Chandola, Banerjee, and Kumar2009, Patcha and Park2007, Hodge and Austin2004], such as network intrusion detection for cyber-security [Dokas et al.2002, Yamanishi et al.2004], fraud detection for credit cards [Aleskerov, Freisleben, and Rao1997], defect detection of industrial machines [Fujimaki, Yairi, and Machida2005, Idé and Kashima2004] and disease outbreak detection [Wong et al.2003].

Anomalies, which are also called outliers, are instances that rarely occur. Therefore, it is natural to consider that instances at a low probability density region are anomalous, and many density estimation based anomaly detection methods have been proposed 

[Barnett and Lewis1974, Parra, Deco, and Miesbach1996, Yeung and Ding2003]

. By the recent advances of deep learning, the density estimation performance has been greatly improved by neural network based density estimators, such as variational autoencoders (VAE) 

[Kingma and Welling2013], flow-based generative models [Dinh, Krueger, and Bengio2014, Dinh, Sohl-Dickstein, and Bengio2016, Kingma and Dhariwal2018], and autoregressive models [Uria, Murray, and Larochelle2013, Raiko et al.2014, Germain et al.2015, Uria et al.2016]. The VAE has been used for anomaly detection [An and Cho2015, Suh et al.2016, Xu et al.2018].

In some situations, the label information, which indicates whether each instance is anomalous or normal, is available [Görnitz et al.2013]

. The label information is valuable for improving the anomaly detection performance. However, the existing neural network based density estimation methods cannot exploit the label information. To use the anomaly label information, supervised classifiers, such as nearest neighbor methods 

[Singh and Silakari2009]

, support vector machines 

[Mukkamala, Sung, and Ribeiro2005]

, and feed-forward neural networks 

[Rapaka, Novokhodko, and Wunsch2003], have been used. However, these standard supervised classifiers do not perform well when labeled anomalous instances are very few, which is often the case, since anomalous instances rarely occur by definition.

In this paper, we propose a neural network density estimator based anomaly detection method that can exploit the label information. The proposed method performs well even when only a few labeled anomalous instances are given since it is based on a density estimator, which works without any labeled anomalous instances. We employ the negative log probability of an instance as its anomaly score. For the density function to calculate the probability, we use neural autoregressive models [Uria et al.2016, Germain et al.2015]. The autoregressive models can compute the probability density exactly for a test instance. On the other hand, the VAE computes the lower bound of the probability density approximately. Moreover, the autoregressive models have been achieved the high density estimation performance compared with other neural density estimators, such as VAE and flow-based generative models [Dinh, Sohl-Dickstein, and Bengio2016].

The density function is trained so that the probability density of normal instances becomes high, which is the same with the standard maximum likelihood estimation. In addition, we would like to make the density function to satisfy that the probability density of anomalous instances is lower than that of normal instances. To achieve this, we introduce a regularization term, which is calculated by using the log likelihood ratio between normal and anomalous instances. Since our objective function is differentiable, the density function can be estimated efficiently by using stochastic gradient-based optimization methods.

Figure 1 illustrates anomaly scores with an unsupervised density estimation based anomaly detection method (a), with a supervised binary classifier based anomaly detection method (b), and with the proposed method (c). The unsupervised method considers only normal instances, and the anomaly score is low where normal instances are located. Since it cannot exploit the information on anomalous instances, the anomaly score cannot be increased even where anomalous instances are located closely. With this example, it succeeds to detect test anomalous instances at the far left and far right, but fails to detect the test anomalous instance at the center, where normal instances are closely located. The supervised method considers both normal and anomalous instances, where a decision boundary is placed between the normal and anomalous instances. It can detect the test anomalous instances at the center since an observed anomalous instance exists in the same region. However, it cannot detect the test anomalous instances at both ends since they are at the normal instance side of the decision boundary. With the proposed method, the anomaly score is high at the region where normal instances are not located as well as the region where anomalous instances are located. Therefore, it can detect all of the test anomalous instances in this example.


(a) Unsupervised anomaly detection (b) Supervised anomaly detection (c) Proposed method
Figure 1: Examples of anomaly scores with an unsupervised density estimation based anomaly detection method (a), with a supervised binary classifier based anomaly detection method (b), and with the proposed method (c). The white triangle represents an observed normal instance, the white circle represents an observed anomalous instance, and the black circle represents a test anomalous instance, which is not observed in training. The horizontal axis is the one-dimensional attribute space, and the vertical axis is the anomaly score, where the anomaly score increases in the downward direction. The marks and indicate that the method can detect the test anomalous instance successfully or not, respectively.

The remainder of the paper is organized as follows. In Section 2, we define our task, and propose our method for supervised anomaly detection based on the neural autoregressive estimators. In Section 3, we briefly review related work. In Section 4, we demonstrate the effectiveness of the proposed method using various datasets. Finally, we present concluding remarks and a discussion of future work in Section 5.

2 Proposed method

Task

Suppose that we have a dataset , where is the -dimensional attribute vector of the th instance, and is its anomaly label, i.e. if it is anomalous and if it is not anomalous, or normal. Our task is to estimate the anomaly score of unseen instances , where the anomaly score of anomalous instances is high, and that of normal instances is low.

Anomaly score

The anomalous instances rarely occur, and the normal instances frequently occur. Then, the proposed method uses the following negative log probability as the anomaly score of instance ,

(1)

where is parameters of the density function.

Density model

For the density function , we use the deep masked autoencoder density estimator (MADE) [Germain et al.2015]

, which is a neural autoregressive model. The probability distribution can always be decomposed into the product of the nested conditional distributions using the probability product rule as follows,

(2)

where is the attribute vector before the th attribute.

We model the conditional distribution with the following Gaussian mixture,

(3)

where is the number of mixture components,

is the Gaussian distribution with mean

and variance

, and , , are the neural networks that define the mixture weight, mean and variance of the th mixture component for the th attribute, respectively, , , .

When the feature vector is a binary, we use the following Bernoulli distribution,

(4)

where is the neural network that outputs the probability of

being one. Similarly, Poisson and Gamma distributions with parameters modeled by neural networks can be used in the cases of non-negative integers and non-negative continuous values, respectively.

With the deep MADE, the conditional densities of different attributes are defined by deep autoencoders with masks so that the conditional density function for the th attribute depends only on the attributes before , , but does not depend on the other attributes, . The MADE is more efficient than other autoregressive models.

Note that in our framework we can use other density estimators, such as VAE and flow-based generative models, as well as autoencoders, where the reconstruction error is used for the anomaly score.

Objective function

Let be a set of indexes of all the given instances, be a set of indexes of anomalous instances, and be a set of indexes of normal instances. The anomaly score of anomalous instances should be higher than those of normal instances as follows,

(5)

In addition, the following log likelihood of the normal instances should be high,

(6)

since the anomaly score, which is defined by the negative log likelihood, of the normal instances should be low. Here, represents the number of elements in the set.

We would like to maximize Eq. (6) while satisfying the constraints in Eq. (5) as much as possible. To make the objective function differentiable with respect to the parameters without constraints, we relax the constraints in Eq. (5) to a soft regularization term as follows,

(7)

where

is the sigmoid function,

(8)

and

is the hyperparameter. Figure 

2 shows the regularization with respect to the log likelihood ratio. When the anomaly score of an anomalous instance is much higher than that of a normal instance, , the sigmoid function takes the maximum value one. When the anomaly score of an anomalous instance is much lower than that of a normal instance, , the sigmoid function takes the minimum value zero. Therefore, the maximization of this regularization term moves parameters so as to satisfy the constraints in Eq. (5). We maximize the objective function (7) with a gradient based optimization method, such as ADAM [Kingma and Ba2014].

When there are no labeled anomalous instances or , the regularization term becomes zero, and the first term on the likelihood remains with the objective function, which is the same objective function with the standard density estimation. Therefore, the proposed method can be seen as a generalization of unsupervised density estimation based anomaly detection methods.

Figure 2: Regularization term of the objective function (7) with respect to the log likelihood ratio .

The regularization term can be seen as a stochastic version of the area under the receiver operating characteristic (ROC) curve (AUC) [Yan et al.2003], since the AUC is computed by

(9)

where is the indicator function, i.e. if is true, otherwise, and the sigmoid function is an approximation of the indicator function .

3 Related work

A number of unsupervised methods for anomaly detection, which is sometimes called outlier detection 

[Hodge and Austin2004]

or novelty detection 

[Markou and Singh2003], have been proposed, such as the local outlier factor [Breunig et al.2000], one-class support vector machines [Schölkopf et al.2001], and the isolation forest [Liu, Ting, and Zhou2008]. With density estimation based anomaly detection methods, Gaussian distributions [Shewhart1931], Gaussian mixtures [Eskin2000]

and kernel density estimators 

[Laxhammar, Falkman, and Sviestins2009]

have been used. The density estimation methods have been regarded as unsuitable for anomaly detection in high-dimensional data due to the difficulty of estimating multivariate probability distributions 

[Friedland, Gentzel, and Jensen2014, Hido et al.2011]. Although some supervised anomaly detection methods have been proposed [Nadeem et al.2016, Gao, Cheng, and Tan2006, Das et al.2016, Das et al.2017, Munawar, Vinayavekhin, and De Magistris2017, Pimentel et al.2018, Akcay, Atapour-Abarghouei, and Breckon2018, Yamanaka et al.2019], they are not based on deep autoregressive density estimators, which can achieve high density estimation performance.

Recent research of neural networks has made substantial progress on density estimation for high-dimensional data. The neural network based density estimators, including VAE [Kingma and Welling2013], flow-based generative models [Dinh, Krueger, and Bengio2014, Dinh, Sohl-Dickstein, and Bengio2016, Kingma and Dhariwal2018], and autoregressive models [Uria, Murray, and Larochelle2013, Raiko et al.2014, Germain et al.2015, Uria et al.2016], can flexibly learn dependencies across different attributes, and have achieved the high density estimation performance. The autoregressive models have been successfully used for density estimations, as well as modeling images [Oord, Kalchbrenner, and Kavukcuoglu2016] and speech [Van Den Oord et al.2016]. The autoregressive models can compute the probability for each instance exactly, which is desirable since we use the probability as the anomaly score. A shortcoming of the autoregressive models is that it requires computational time to generate samples. However, generating samples is not necessary for anomaly detection. The VAE can compute the approximation of the lower bound of the log probability, but it cannot compute the probability exactly. By using importance sampling [Burda, Grosse, and Salakhutdinov2015], one can calculate a lower bound that approaches the true log probability as the number of samples increases, although it requires the infinite number of samples for the true probability. The flow-based generative models can calculate the probability exactly, and can generate samples efficiently and effectively [Kingma and Dhariwal2018]. However, it requires specialized functions for transformations, which are invertible and their determinant of the Jacobian matrix can be easily calculated, and the density estimation performance is lower than the autoregressive models. Although we use the autoregressive models in our framework, VAE and flow-based generative models are straightforwardly applicable to our framework. The VAE has been used for unsupervised anomaly detection [An and Cho2015, Suh et al.2016, Xu et al.2018], but not for supervised anomaly detection.

In the proposed method with the high value of the hyperparameter , the second term in the objective function (7), which is an approximation of the AUC, is dominant. Therefore, the proposed method is related to AUC maximization [Cortes and Mohri2004, Brefeld and Scheffer2005], which has been used for training on class imbalanced data. The proposed method employs the likelihood maximization with normal instances as well as the AUC maximization, which enables us to improve the performance with a few training anomalous instances. With our experiments described in Section 4, we demonstrate that both of the likelihood and AUC maximizations are effective to achieve good performance in various datasets.

4 Experiments

Data

We evaluated our proposed supervised anomaly detection method based on deep autoregressive density estimators with 16 datasets used for unsupervised outlier detection [Campos et al.2016] 111The datasets were obtained from http://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/. The number of instances , the number of attributes , the number of anomalous instances and the anomaly rate of the datasets are shown in Table 1. Each attribute was linearly normalized to the range , and duplicate instances were removed. We used 80% of the normal instances and three anomalous instances for training, 10% of the normal instances and three anomalous instances for validation, and remaining instances for testing. For the evaluation measurement, we used the AUC. For each dataset, we randomly generated ten sets of training, validation and test data, and calculated the average AUC over the ten sets.

data
Annthyroid 7016 21 350 0.050
Arrhythmia 305 259 61 0.200
Cardiotocography 2068 21 413 0.200
HeartDisease 187 13 37 0.198
InternetAds 1775 1555 177 0.100
Ionosphere 351 32 126 0.359
KDDCup99 60839 79 246 0.004
PageBlocks 5171 10 258 0.050
Parkinson 60 22 12 0.200
PenDigits 9868 16 20 0.002
Pima 625 8 125 0.200
Shuttle 1013 9 13 0.013
SpamBase 3485 57 697 0.200
Stamps 325 9 16 0.049
Waveform 3443 21 100 0.029
Wilt 4671 5 93 0.020
Table 1: Statistics of the datasets used in our experiments. is the number of instances, is the number of attributes, is the number of anomalous instances, and is the anomaly rate.

Comparing methods

We compared the proposed method with the following nine methods: LOF, OCSVM, IF, VAE, MADE, KNN, SVM, RF and NN. LOF, OCSVM, IF, VAE and MADE are unsupervised anomaly detection methods, where the attribute

is used for calculating the anomaly score, but the label information is not used. KNN, SVM, RF, NN as well as the proposed method are supervised anomaly detection methods, where both the attribute and the label information are used. The hyperparameters were selected based on the AUC score on the validation data for both of the unsupervised and supervised methods. We used the implementation of scikit-learn [Pedregosa et al.2011] with LOF, OCSVM, IF, KNN, SVM, RF and NN.

  • LOF is the local outlier factor method [Breunig et al.2000]. The LOF unsupervisedly detects anomalies based on the degree of isolation from the surrounding neighborhood. The number of neighbors was tuned from using the validation data.

  • OCSVM is the one-class support vector machine [Schölkopf et al.2001]

    , which is an extension of the support vector machine (SVM) to the case of unlabeled data. The OCSVM finds the maximal margin hyperplane which separates the given normal data from the origin by embedding them into a high dimensional space via a kernel function. We used the RBF kernel, and the kernel hyperparameter was tuned from

    .

  • IF is the isolation forest method [Liu, Ting, and Zhou2008], which is a tree-based unsupervised anomaly detection method. The IF isolates anomalies by randomly selecting an attribute and randomly selecting a split value between the maximum and minimum values of the selected attribute. The number of base estimators was chosen from .

  • VAE is the variational autoencoder [Kingma and Welling2013]

    , which is a density estimation method based on neural networks. With the VAE, the observation is assumed to follow a Gaussian distribution, where the mean and variance are modeled by a neural network that takes latent variables as the input. The latent variable is also modeled by another neural network that takes the attribute vector as the input. We used three-layered feed-forward neural networks with 100 hidden units, and the 20-dimensional latent space. We optimized the neural network parameters using ADAM. The number of epochs was selected by using the validation data.

  • MADE is the deep masked autoencoder density estimator [Germain et al.2015], which is used with the proposed method for the density function. The proposed method with corresponds to the MADE. We used the same parameter setting with the proposed method for the MADE, which is described in the next subsection.

  • KNN is the -nearest neighbor method, which classifies instances based on the votes of the neighbors. The number of neighbors was selected from .

  • SVM is the support vector machine [Schölkopf, Smola, and others2002], which is a kernel-based binary classification method. We used the RBF kernel, and the kernel hyperparameter was tuned from .

  • RF

    is the random forest method 

    [Breiman2001]

    , which is a meta estimator that fits a number of decision tree classifiers. The number of trees was chosen from

    .

  • NN

    is the feed-forward neural network classifier. We used three layers with rectified linear unit (ReLU) activation, where the number of hidden units was selected from

    .

Settings of the proposed method

We used Gaussian mixtures with components for the output layer. The number of hidden layers was one, the number of hidden units was 500, the number of masks was ten, and the number of different orderings was ten. The hyperparameter was selected from using the validation data. The validation data were also used for early stopping, where the maximum number of training epochs was 100. We optimized the neural network parameters using ADAM with learning rate .

Results

LOF OCSVM IF VAE MADE KNN SVM RF NN Proposed
Annthyroid 0.627 0.667 0.700 0.766 0.716 0.510 0.741 0.875 0.596 0.776
Arrhythmia 0.711 0.718 0.649 0.668 0.694 0.504 0.568 0.621 0.638 0.677
Cardiotocography 0.569 0.834 0.836 0.726 0.828 0.582 0.846 0.707 0.554 0.873
HeartDisease 0.581 0.688 0.759 0.729 0.803 0.664 0.722 0.701 0.540 0.825
InternetAds 0.677 0.836 0.562 0.860 0.780 0.533 0.804 0.601 0.711 0.834
Ionosphere 0.860 0.951 0.864 0.844 0.821 0.564 0.933 0.883 0.792 0.845
KDDCup99 0.582 0.993 0.992 0.968 0.995 0.707 0.812 0.939 0.979 0.988
PageBlocks 0.776 0.930 0.923 0.924 0.847 0.560 0.730 0.675 0.428 0.815
Parkinson 0.840 0.847 0.748 0.747 0.797 0.640 0.817 0.735 0.733 0.807
PenDigits 0.898 0.989 0.955 0.915 0.901 0.839 0.999 0.946 0.617 0.993
Pima 0.586 0.688 0.737 0.677 0.725 0.530 0.649 0.569 0.296 0.744
Shuttle 0.962 0.918 0.949 0.952 0.927 0.879 0.997 0.999 0.400 0.969
SpamBase 0.521 0.662 0.781 0.775 0.735 0.536 0.750 0.766 0.621 0.786
Stamps 0.814 0.890 0.922 0.908 0.902 0.761 0.895 0.868 0.832 0.904
Waveform 0.729 0.737 0.709 0.751 0.743 0.532 0.801 0.591 0.709 0.800
Wilt 0.731 0.352 0.596 0.455 0.707 0.503 0.801 0.623 0.649 0.785
average 0.717 0.794 0.793 0.791 0.807 0.615 0.804 0.756 0.631 0.839
Table 2:

AUCs on 16 datasets with three training anomalous instances by unsupervised anomaly detection methods (LOF, OCSVM, IF, VAE, MADE) and supervised anomaly detection methods (KNN, SVM, RF, NN, Proposed). Values in bold typeface are not statistically different (at the 5% level) from the best performing method according to a paired t-test. The bottom row shows the average AUC over the datasets.

(a) one training anomalous instance KNN SVM RF NN Proposed Annthyroid 0.506 0.664 0.691 0.586 0.726 Arrhythmia 0.503 0.522 0.545 0.550 0.680 Cardiotocography 0.529 0.744 0.565 0.500 0.846 HeartDisease 0.542 0.765 0.605 0.292 0.820 InternetAds 0.513 0.700 0.546 0.615 0.851 Ionosphere 0.516 0.919 0.745 0.758 0.820 KDDCup99 0.575 0.833 0.791 0.979 0.949 PageBlocks 0.514 0.687 0.530 0.421 0.848 Parkinson 0.642 0.847 0.762 0.445 0.850 PenDigits 0.766 0.996 0.648 0.483 0.987 Pima 0.506 0.586 0.517 0.365 0.691 Shuttle 0.661 0.989 0.861 0.532 0.985 SpamBase 0.510 0.758 0.697 0.462 0.703 Stamps 0.639 0.826 0.726 0.795 0.849 Waveform 0.506 0.752 0.510 0.695 0.748 Wilt 0.503 0.725 0.543 0.614 0.776 average 0.558 0.770 0.643 0.568 0.821 (b) five training anomalous instances KNN SVM RF NN Proposed 0.517 0.807 0.857 0.610 0.761 0.513 0.650 0.642 0.657 0.701 0.589 0.885 0.790 0.594 0.901 0.740 0.809 0.779 0.695 0.824 0.568 0.849 0.666 0.786 0.866 0.713 0.943 0.945 0.817 0.887 0.754 0.890 0.962 0.980 0.979 0.630 0.726 0.789 0.450 0.891 0.890 0.999 0.969 0.733 0.995 0.546 0.663 0.653 0.304 0.771 0.562 0.814 0.869 0.780 0.852 0.873 0.946 0.924 0.805 0.920 0.560 0.894 0.678 0.750 0.866 0.512 0.852 0.730 0.654 0.812 0.640 0.838 0.804 0.687 0.859
Table 3: AUCs on datasets with (a) one and (b) five training anomalous instances by supervised anomaly detection methods. The AUCs by unsupervised methods are the same with Table 2 since they do not use the anomaly information for training. Values in bold typeface are not statistically different (at the 5% level) from the best performing method according to a paired t-test including AUCs by the unsupervised methods. The AUCs on Parkinson and Shuttle datasets with five training anomalous instances could not be calculated since they do not contain enough anomalous instances.
Annthyroid Arrhythmia Cardiotocography HeartDisease InternetAds Ionosphere
KDDCup99 PageBlocks Parkinson PenDigits Pima Shuttle
SpamBase Stamps Waveform Wilt
Figure 3: AUCs by the proposed method with different hyperparameters . The x-axis is

and the y-axis is the AUC. The errorbar shows the standard error. The horizontal line is the AUC with

.

Table 2 shows the AUC results. The proposed method achieved the highest average AUC among the ten methods. The AUC by the MADE was high compared with the other unsupervised methods, which indicates that the neural autoregressive density estimators would be useful for unsupervised anomaly detection. With some datasets, the AUC by the proposed method was statistically higher than that by the MADE, e.g. Annthyroid, PenDigits and Wilt. There were no datasets where the AUC by the MADE was statistically higher than that by the proposed method. This result indicates that the regularization term in the proposed method works well for improving anomaly detection performance with a few labeled anomalous instances. The SVM achieved the high AUC among supervised binary classifier based methods. However, the performance of the SVM with some datasets was very low, e.g. Arrhythmia and Pima. On the other hand, the proposed method achieved relatively high AUC on all of the datasets. This would be because the proposed method incorporates the characteristics of unsupervised methods by the likelihood term in the objective function as well as the characteristics of supervised methods by the regularization term. Table 3 shows the AUC results with (a) one and (b) five training anomalous instances by the supervised methods. The proposed method also achieved the highest average AUC with these settings. The average AUC by the proposed method increased as the number of training anomalous instances increased, where the AUCs were 0.807, 0.821, 0.839, 0.859 when zero, one, three, five anomalous instances were used for training. The average computational time for training the proposed method was 2.78, 0.53, 0.24, 0.03, 9.50, 0.05, 62.39, 1.36, 0.02, 4.12, 0.05, 0.09, 2.06, 0.03, 0.99, 0.82 seconds on Annthyroid, Arrhythmia, Cardiotocography, HeartDisease, InternetAds, Ionosphere, KDDCup99, PageBlocks, Parkinson, PenDigits, Pima, Shuttle, SpamBase, Stamps, Waveform, Wilt dataset, respectively, using a computer with a Xeon Gold 6130 CPU 2.10GHz.

Figure 3 shows the AUCs on the test data by the proposed method with different hyperparameters trained on datasets with three anomalous instances. The best hyperparameters were different across datasets. For example, the high was better with the PenDigits and Shuttle datasets, the low was better with the Annthyroid and Stamps datasets, and the intermediate was better with the Cardiotocography and Ionosphere datasets. This result indicates that the AUC maximization without the likelihood maximization, which corresponds to the proposed method with the high , is not effective in some datasets. The proposed method achieved the high performance with various datasets by automatically adapting the using the validation data to control the balance of the likelihood maximization and the AUC maximization.

5 Conclusion

We have proposed a supervised anomaly detection method based on neural autoregressive models. With the proposed method, the neural autoregressive model is trained so that the likelihood of normal instances is maximized and the likelihood of anomalous instances is lower than that of normal instances. The proposed method can detect anomalies in the region where there are no normal instances as well as in the region where anomalous instances are located closely. We have experimentally confirmed the effectiveness of the proposed method using 16 datasets. Although our results have been encouraging to date, our approach can be further improved in a number of ways. First, we would like to extend our framework for semi-supervised setting [Blanchard, Lee, and Scott2010], where unlabeled instances as well as labeled anomalous and normal instances are given. Second, we plan to incorporate other neural density estimators including VAE to our framework.

References

  • [Akcay, Atapour-Abarghouei, and Breckon2018] Akcay, S.; Atapour-Abarghouei, A.; and Breckon, T. P. 2018. Ganomaly: Semi-supervised anomaly detection via adversarial training. arXiv preprint arXiv:1805.06725.
  • [Aleskerov, Freisleben, and Rao1997] Aleskerov, E.; Freisleben, B.; and Rao, B. 1997. Cardwatch: A neural network based database mining system for credit card fraud detection. In IEEE/IAFE Computational Intelligence for Financial Engineering, 220–226.
  • [An and Cho2015] An, J., and Cho, S. 2015. Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2:1–18.
  • [Barnett and Lewis1974] Barnett, V., and Lewis, T. 1974. Outliers in statistical data. Wiley.
  • [Blanchard, Lee, and Scott2010] Blanchard, G.; Lee, G.; and Scott, C. 2010. Semi-supervised novelty detection.

    Journal of Machine Learning Research

    11(Nov):2973–3009.
  • [Brefeld and Scheffer2005] Brefeld, U., and Scheffer, T. 2005. AUC maximizing support vector learning. In Proceedings of the ICML Workshop on ROC Analysis in Machine Learning.
  • [Breiman2001] Breiman, L. 2001. Random forests. Machine learning 45(1):5–32.
  • [Breunig et al.2000] Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; and Sander, J. 2000. LOF: identifying density-based local outliers. ACM SIGMOD Record 29(2):93–104.
  • [Burda, Grosse, and Salakhutdinov2015] Burda, Y.; Grosse, R.; and Salakhutdinov, R. 2015. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.
  • [Campos et al.2016] Campos, G. O.; Zimek, A.; Sander, J.; Campello, R. J.; Micenková, B.; Schubert, E.; Assent, I.; and Houle, M. E. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30(4):891–927.
  • [Chandola, Banerjee, and Kumar2009] Chandola, V.; Banerjee, A.; and Kumar, V. 2009. Anomaly detection: A survey. ACM Computing Surveys 41(3):15.
  • [Cortes and Mohri2004] Cortes, C., and Mohri, M. 2004. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems, 313–320.
  • [Das et al.2016] Das, S.; Wong, W.-K.; Dietterich, T.; Fern, A.; and Emmott, A. 2016. Incorporating expert feedback into active anomaly discovery. In 16th International Conference on Data Mining, 853–858. IEEE.
  • [Das et al.2017] Das, S.; Wong, W.-K.; Fern, A.; Dietterich, T. G.; and Siddiqui, M. A. 2017. Incorporating feedback into tree-based anomaly detectionw. arXiv preprint arXiv:1708.09441.
  • [Dinh, Krueger, and Bengio2014] Dinh, L.; Krueger, D.; and Bengio, Y. 2014. NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.
  • [Dinh, Sohl-Dickstein, and Bengio2016] Dinh, L.; Sohl-Dickstein, J.; and Bengio, S. 2016. Density estimation using real NVP. arXiv preprint arXiv:1605.08803.
  • [Dokas et al.2002] Dokas, P.; Ertoz, L.; Kumar, V.; Lazarevic, A.; Srivastava, J.; and Tan, P.-N. 2002. Data mining for network intrusion detection. In NSF Workshop on Next Generation Data Mining, 21–30.
  • [Eskin2000] Eskin, E. 2000. Anomaly detection over noisy data using learned probability distributions. In International Conference on Machine Learning.
  • [Friedland, Gentzel, and Jensen2014] Friedland, L.; Gentzel, A.; and Jensen, D. 2014. Classifier-adjusted density estimation for anomaly detection and one-class classification. In SIAM International Conference on Data Mining, 578–586.
  • [Fujimaki, Yairi, and Machida2005] Fujimaki, R.; Yairi, T.; and Machida, K. 2005. An approach to spacecraft anomaly detection problem using kernel feature space. In International Conference on Knowledge Discovery in Data Mining, 401–410.
  • [Gao, Cheng, and Tan2006] Gao, J.; Cheng, H.; and Tan, P.-N. 2006. A novel framework for incorporating labeled examples into anomaly detection. In Proceedings of the 2006 SIAM International Conference on Data Mining, 594–598. SIAM.
  • [Germain et al.2015] Germain, M.; Gregor, K.; Murray, I.; and Larochelle, H. 2015. MADE: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, 881–889.
  • [Görnitz et al.2013] Görnitz, N.; Kloft, M.; Rieck, K.; and Brefeld, U. 2013. Toward supervised anomaly detection. Journal of Artificial Intelligence Research 46:235–262.
  • [Hido et al.2011] Hido, S.; Tsuboi, Y.; Kashima, H.; Sugiyama, M.; and Kanamori, T. 2011. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems 26(2):309–336.
  • [Hodge and Austin2004] Hodge, V., and Austin, J. 2004. A survey of outlier detection methodologies. Artificial Ntelligence Review 22(2):85–126.
  • [Idé and Kashima2004] Idé, T., and Kashima, H. 2004. Eigenspace-based anomaly detection in computer systems. In International Conference on Knowledge Discovery and Data Mining, 440–449.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Kingma and Dhariwal2018] Kingma, D. P., and Dhariwal, P. 2018. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039.
  • [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.
  • [Laxhammar, Falkman, and Sviestins2009] Laxhammar, R.; Falkman, G.; and Sviestins, E. 2009.

    Anomaly detection in sea traffic - a comparison of the Gaussian mixture model and the kernel density estimator.

    In International Conference on Information Fusion, 756–763.
  • [Liu, Ting, and Zhou2008] Liu, F. T.; Ting, K. M.; and Zhou, Z.-H. 2008. Isolation forest. In Proceeding of the 8th IEEE International Conference on Data Mining, 413–422. IEEE.
  • [Markou and Singh2003] Markou, M., and Singh, S. 2003. Novelty detection: a review. Signal processing 83(12):2481–2497.
  • [Mukkamala, Sung, and Ribeiro2005] Mukkamala, S.; Sung, A.; and Ribeiro, B. 2005. Model selection for kernel based intrusion detection systems. In Adaptive and Natural Computing Algorithms, 458–461. Springer.
  • [Munawar, Vinayavekhin, and De Magistris2017] Munawar, A.; Vinayavekhin, P.; and De Magistris, G. 2017. Limiting the reconstruction capability of generative neural network using negative learning. In 27th International Workshop on Machine Learning for Signal Processing. IEEE.
  • [Nadeem et al.2016] Nadeem, M.; Marshall, O.; Singh, S.; Fang, X.; and Yuan, X. 2016. Semi-supervised deep neural network for network intrusion detection. In KSU Conference on Cybersecurity Education, Research and Practice.
  • [Oord, Kalchbrenner, and Kavukcuoglu2016] Oord, A. v. d.; Kalchbrenner, N.; and Kavukcuoglu, K. 2016. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759.
  • [Parra, Deco, and Miesbach1996] Parra, L.; Deco, G.; and Miesbach, S. 1996. Statistical independence and novelty detection with information preserving nonlinear maps. Neural Computation 8(2):260–269.
  • [Patcha and Park2007] Patcha, A., and Park, J.-M. 2007. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Computer Networks 51(12):3448–3470.
  • [Pedregosa et al.2011] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12:2825–2830.
  • [Pimentel et al.2018] Pimentel, T.; Monteiro, M.; Viana, J.; Veloso, A.; and Ziviani, N. 2018. A generalized active learning approach for unsupervised anomaly detection. arXiv preprint arXiv:1805.09411.
  • [Raiko et al.2014] Raiko, T.; Li, Y.; Cho, K.; and Bengio, Y. 2014. Iterative neural autoregressive distribution estimator NADE-k. In Advances in Neural Information Processing Systems, 325–333.
  • [Rapaka, Novokhodko, and Wunsch2003] Rapaka, A.; Novokhodko, A.; and Wunsch, D. 2003.

    Intrusion detection using radial basis function network on sequences of system calls.

    In International Joint Conference on Neural Networks, volume 3, 1820–1825.
  • [Schölkopf et al.2001] Schölkopf, B.; Platt, J. C.; Shawe-Taylor, J.; Smola, A. J.; and Williamson, R. C. 2001. Estimating the support of a high-dimensional distribution. Neural Computation 13(7):1443–1471.
  • [Schölkopf, Smola, and others2002] Schölkopf, B.; Smola, A. J.; et al. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
  • [Shewhart1931] Shewhart, W. A. 1931. Economic control of quality of manufactured product. ASQ Quality Press.
  • [Singh and Silakari2009] Singh, S., and Silakari, S. 2009. An ensemble approach for feature selection of cyber attack dataset. arXiv preprint arXiv:0912.1014.
  • [Suh et al.2016] Suh, S.; Chae, D. H.; Kang, H.-G.; and Choi, S. 2016. Echo-state conditional variational autoencoder for anomaly detection. In International Joint Conference on Neural Networks, 1015–1022.
  • [Uria et al.2016] Uria, B.; Côté, M.-A.; Gregor, K.; Murray, I.; and Larochelle, H. 2016. Neural autoregressive distribution estimation. Journal of Machine Learning Research 17(1):7184–7220.
  • [Uria, Murray, and Larochelle2013] Uria, B.; Murray, I.; and Larochelle, H. 2013. RNADE: The real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems, 2175–2183.
  • [Van Den Oord et al.2016] Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A. W.; and Kavukcuoglu, K. 2016. WaveNet: A generative model for raw audio. In SSW, 125.
  • [Wong et al.2003] Wong, W.-K.; Moore, A. W.; Cooper, G. F.; and Wagner, M. M. 2003. Bayesian network anomaly pattern detection for disease outbreaks. In International Conference on Machine Learning, 808–815.
  • [Xu et al.2018] Xu, H.; Chen, W.; Zhao, N.; Li, Z.; Bu, J.; Li, Z.; Liu, Y.; Zhao, Y.; Pei, D.; Feng, Y.; et al. 2018. Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In World Wide Web Conference, 187–196.
  • [Yamanaka et al.2019] Yamanaka, Y.; Iwata, T.; Takahashi, H.; Yamada, M.; and Kanai, S. 2019. Autoencoding binary classifiers for supervised anomaly detection. arXiv preprint arXiv:1903.10709.
  • [Yamanishi et al.2004] Yamanishi, K.; Takeuchi, J.-I.; Williams, G.; and Milne, P. 2004. On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery 8(3):275–300.
  • [Yan et al.2003] Yan, L.; Dodier, R. H.; Mozer, M.; and Wolniewicz, R. H. 2003. Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic. In International Conference on Machine Learning, 848–855.
  • [Yeung and Ding2003] Yeung, D.-Y., and Ding, Y. 2003. Host-based intrusion detection using dynamic and static behavioral models. Pattern recognition 36(1):229–243.