1 Introduction
Anomaly detection is an important task in artificial intelligence, which is a task to find anomalous instances in a dataset. The anomaly detection has been used in a wide variety of applications
[Chandola, Banerjee, and Kumar2009, Patcha and Park2007, Hodge and Austin2004], such as network intrusion detection for cybersecurity [Dokas et al.2002, Yamanishi et al.2004], fraud detection for credit cards [Aleskerov, Freisleben, and Rao1997], defect detection of industrial machines [Fujimaki, Yairi, and Machida2005, Idé and Kashima2004] and disease outbreak detection [Wong et al.2003].Anomalies, which are also called outliers, are instances that rarely occur. Therefore, it is natural to consider that instances at a low probability density region are anomalous, and many density estimation based anomaly detection methods have been proposed
[Barnett and Lewis1974, Parra, Deco, and Miesbach1996, Yeung and Ding2003]. By the recent advances of deep learning, the density estimation performance has been greatly improved by neural network based density estimators, such as variational autoencoders (VAE)
[Kingma and Welling2013], flowbased generative models [Dinh, Krueger, and Bengio2014, Dinh, SohlDickstein, and Bengio2016, Kingma and Dhariwal2018], and autoregressive models [Uria, Murray, and Larochelle2013, Raiko et al.2014, Germain et al.2015, Uria et al.2016]. The VAE has been used for anomaly detection [An and Cho2015, Suh et al.2016, Xu et al.2018].In some situations, the label information, which indicates whether each instance is anomalous or normal, is available [Görnitz et al.2013]
. The label information is valuable for improving the anomaly detection performance. However, the existing neural network based density estimation methods cannot exploit the label information. To use the anomaly label information, supervised classifiers, such as nearest neighbor methods
[Singh and Silakari2009][Mukkamala, Sung, and Ribeiro2005], and feedforward neural networks
[Rapaka, Novokhodko, and Wunsch2003], have been used. However, these standard supervised classifiers do not perform well when labeled anomalous instances are very few, which is often the case, since anomalous instances rarely occur by definition.In this paper, we propose a neural network density estimator based anomaly detection method that can exploit the label information. The proposed method performs well even when only a few labeled anomalous instances are given since it is based on a density estimator, which works without any labeled anomalous instances. We employ the negative log probability of an instance as its anomaly score. For the density function to calculate the probability, we use neural autoregressive models [Uria et al.2016, Germain et al.2015]. The autoregressive models can compute the probability density exactly for a test instance. On the other hand, the VAE computes the lower bound of the probability density approximately. Moreover, the autoregressive models have been achieved the high density estimation performance compared with other neural density estimators, such as VAE and flowbased generative models [Dinh, SohlDickstein, and Bengio2016].
The density function is trained so that the probability density of normal instances becomes high, which is the same with the standard maximum likelihood estimation. In addition, we would like to make the density function to satisfy that the probability density of anomalous instances is lower than that of normal instances. To achieve this, we introduce a regularization term, which is calculated by using the log likelihood ratio between normal and anomalous instances. Since our objective function is differentiable, the density function can be estimated efficiently by using stochastic gradientbased optimization methods.
Figure 1 illustrates anomaly scores with an unsupervised density estimation based anomaly detection method (a), with a supervised binary classifier based anomaly detection method (b), and with the proposed method (c). The unsupervised method considers only normal instances, and the anomaly score is low where normal instances are located. Since it cannot exploit the information on anomalous instances, the anomaly score cannot be increased even where anomalous instances are located closely. With this example, it succeeds to detect test anomalous instances at the far left and far right, but fails to detect the test anomalous instance at the center, where normal instances are closely located. The supervised method considers both normal and anomalous instances, where a decision boundary is placed between the normal and anomalous instances. It can detect the test anomalous instances at the center since an observed anomalous instance exists in the same region. However, it cannot detect the test anomalous instances at both ends since they are at the normal instance side of the decision boundary. With the proposed method, the anomaly score is high at the region where normal instances are not located as well as the region where anomalous instances are located. Therefore, it can detect all of the test anomalous instances in this example.
(a) Unsupervised anomaly detection  (b) Supervised anomaly detection  (c) Proposed method 
The remainder of the paper is organized as follows. In Section 2, we define our task, and propose our method for supervised anomaly detection based on the neural autoregressive estimators. In Section 3, we briefly review related work. In Section 4, we demonstrate the effectiveness of the proposed method using various datasets. Finally, we present concluding remarks and a discussion of future work in Section 5.
2 Proposed method
Task
Suppose that we have a dataset , where is the dimensional attribute vector of the th instance, and is its anomaly label, i.e. if it is anomalous and if it is not anomalous, or normal. Our task is to estimate the anomaly score of unseen instances , where the anomaly score of anomalous instances is high, and that of normal instances is low.
Anomaly score
The anomalous instances rarely occur, and the normal instances frequently occur. Then, the proposed method uses the following negative log probability as the anomaly score of instance ,
(1) 
where is parameters of the density function.
Density model
For the density function , we use the deep masked autoencoder density estimator (MADE) [Germain et al.2015]
, which is a neural autoregressive model. The probability distribution can always be decomposed into the product of the nested conditional distributions using the probability product rule as follows,
(2) 
where is the attribute vector before the th attribute.
We model the conditional distribution with the following Gaussian mixture,
(3) 
where is the number of mixture components,
is the Gaussian distribution with mean
and variance
, and , , are the neural networks that define the mixture weight, mean and variance of the th mixture component for the th attribute, respectively, , , .When the feature vector is a binary, we use the following Bernoulli distribution,
(4) 
where is the neural network that outputs the probability of
being one. Similarly, Poisson and Gamma distributions with parameters modeled by neural networks can be used in the cases of nonnegative integers and nonnegative continuous values, respectively.
With the deep MADE, the conditional densities of different attributes are defined by deep autoencoders with masks so that the conditional density function for the th attribute depends only on the attributes before , , but does not depend on the other attributes, . The MADE is more efficient than other autoregressive models.
Note that in our framework we can use other density estimators, such as VAE and flowbased generative models, as well as autoencoders, where the reconstruction error is used for the anomaly score.
Objective function
Let be a set of indexes of all the given instances, be a set of indexes of anomalous instances, and be a set of indexes of normal instances. The anomaly score of anomalous instances should be higher than those of normal instances as follows,
(5) 
In addition, the following log likelihood of the normal instances should be high,
(6) 
since the anomaly score, which is defined by the negative log likelihood, of the normal instances should be low. Here, represents the number of elements in the set.
We would like to maximize Eq. (6) while satisfying the constraints in Eq. (5) as much as possible. To make the objective function differentiable with respect to the parameters without constraints, we relax the constraints in Eq. (5) to a soft regularization term as follows,
(7) 
where
is the sigmoid function,
(8) 
and
is the hyperparameter. Figure
2 shows the regularization with respect to the log likelihood ratio. When the anomaly score of an anomalous instance is much higher than that of a normal instance, , the sigmoid function takes the maximum value one. When the anomaly score of an anomalous instance is much lower than that of a normal instance, , the sigmoid function takes the minimum value zero. Therefore, the maximization of this regularization term moves parameters so as to satisfy the constraints in Eq. (5). We maximize the objective function (7) with a gradient based optimization method, such as ADAM [Kingma and Ba2014].When there are no labeled anomalous instances or , the regularization term becomes zero, and the first term on the likelihood remains with the objective function, which is the same objective function with the standard density estimation. Therefore, the proposed method can be seen as a generalization of unsupervised density estimation based anomaly detection methods.
The regularization term can be seen as a stochastic version of the area under the receiver operating characteristic (ROC) curve (AUC) [Yan et al.2003], since the AUC is computed by
(9) 
where is the indicator function, i.e. if is true, otherwise, and the sigmoid function is an approximation of the indicator function .
3 Related work
A number of unsupervised methods for anomaly detection, which is sometimes called outlier detection
[Hodge and Austin2004][Markou and Singh2003], have been proposed, such as the local outlier factor [Breunig et al.2000], oneclass support vector machines [Schölkopf et al.2001], and the isolation forest [Liu, Ting, and Zhou2008]. With density estimation based anomaly detection methods, Gaussian distributions [Shewhart1931], Gaussian mixtures [Eskin2000][Laxhammar, Falkman, and Sviestins2009]have been used. The density estimation methods have been regarded as unsuitable for anomaly detection in highdimensional data due to the difficulty of estimating multivariate probability distributions
[Friedland, Gentzel, and Jensen2014, Hido et al.2011]. Although some supervised anomaly detection methods have been proposed [Nadeem et al.2016, Gao, Cheng, and Tan2006, Das et al.2016, Das et al.2017, Munawar, Vinayavekhin, and De Magistris2017, Pimentel et al.2018, Akcay, AtapourAbarghouei, and Breckon2018, Yamanaka et al.2019], they are not based on deep autoregressive density estimators, which can achieve high density estimation performance.Recent research of neural networks has made substantial progress on density estimation for highdimensional data. The neural network based density estimators, including VAE [Kingma and Welling2013], flowbased generative models [Dinh, Krueger, and Bengio2014, Dinh, SohlDickstein, and Bengio2016, Kingma and Dhariwal2018], and autoregressive models [Uria, Murray, and Larochelle2013, Raiko et al.2014, Germain et al.2015, Uria et al.2016], can flexibly learn dependencies across different attributes, and have achieved the high density estimation performance. The autoregressive models have been successfully used for density estimations, as well as modeling images [Oord, Kalchbrenner, and Kavukcuoglu2016] and speech [Van Den Oord et al.2016]. The autoregressive models can compute the probability for each instance exactly, which is desirable since we use the probability as the anomaly score. A shortcoming of the autoregressive models is that it requires computational time to generate samples. However, generating samples is not necessary for anomaly detection. The VAE can compute the approximation of the lower bound of the log probability, but it cannot compute the probability exactly. By using importance sampling [Burda, Grosse, and Salakhutdinov2015], one can calculate a lower bound that approaches the true log probability as the number of samples increases, although it requires the infinite number of samples for the true probability. The flowbased generative models can calculate the probability exactly, and can generate samples efficiently and effectively [Kingma and Dhariwal2018]. However, it requires specialized functions for transformations, which are invertible and their determinant of the Jacobian matrix can be easily calculated, and the density estimation performance is lower than the autoregressive models. Although we use the autoregressive models in our framework, VAE and flowbased generative models are straightforwardly applicable to our framework. The VAE has been used for unsupervised anomaly detection [An and Cho2015, Suh et al.2016, Xu et al.2018], but not for supervised anomaly detection.
In the proposed method with the high value of the hyperparameter , the second term in the objective function (7), which is an approximation of the AUC, is dominant. Therefore, the proposed method is related to AUC maximization [Cortes and Mohri2004, Brefeld and Scheffer2005], which has been used for training on class imbalanced data. The proposed method employs the likelihood maximization with normal instances as well as the AUC maximization, which enables us to improve the performance with a few training anomalous instances. With our experiments described in Section 4, we demonstrate that both of the likelihood and AUC maximizations are effective to achieve good performance in various datasets.
4 Experiments
Data
We evaluated our proposed supervised anomaly detection method based on deep autoregressive density estimators with 16 datasets used for unsupervised outlier detection [Campos et al.2016] ^{1}^{1}1The datasets were obtained from http://www.dbs.ifi.lmu.de/research/outlierevaluation/DAMI/. The number of instances , the number of attributes , the number of anomalous instances and the anomaly rate of the datasets are shown in Table 1. Each attribute was linearly normalized to the range , and duplicate instances were removed. We used 80% of the normal instances and three anomalous instances for training, 10% of the normal instances and three anomalous instances for validation, and remaining instances for testing. For the evaluation measurement, we used the AUC. For each dataset, we randomly generated ten sets of training, validation and test data, and calculated the average AUC over the ten sets.
data  

Annthyroid  7016  21  350  0.050 
Arrhythmia  305  259  61  0.200 
Cardiotocography  2068  21  413  0.200 
HeartDisease  187  13  37  0.198 
InternetAds  1775  1555  177  0.100 
Ionosphere  351  32  126  0.359 
KDDCup99  60839  79  246  0.004 
PageBlocks  5171  10  258  0.050 
Parkinson  60  22  12  0.200 
PenDigits  9868  16  20  0.002 
Pima  625  8  125  0.200 
Shuttle  1013  9  13  0.013 
SpamBase  3485  57  697  0.200 
Stamps  325  9  16  0.049 
Waveform  3443  21  100  0.029 
Wilt  4671  5  93  0.020 
Comparing methods
We compared the proposed method with the following nine methods: LOF, OCSVM, IF, VAE, MADE, KNN, SVM, RF and NN. LOF, OCSVM, IF, VAE and MADE are unsupervised anomaly detection methods, where the attribute
is used for calculating the anomaly score, but the label information is not used. KNN, SVM, RF, NN as well as the proposed method are supervised anomaly detection methods, where both the attribute and the label information are used. The hyperparameters were selected based on the AUC score on the validation data for both of the unsupervised and supervised methods. We used the implementation of scikitlearn [Pedregosa et al.2011] with LOF, OCSVM, IF, KNN, SVM, RF and NN.
LOF is the local outlier factor method [Breunig et al.2000]. The LOF unsupervisedly detects anomalies based on the degree of isolation from the surrounding neighborhood. The number of neighbors was tuned from using the validation data.

OCSVM is the oneclass support vector machine [Schölkopf et al.2001]
, which is an extension of the support vector machine (SVM) to the case of unlabeled data. The OCSVM finds the maximal margin hyperplane which separates the given normal data from the origin by embedding them into a high dimensional space via a kernel function. We used the RBF kernel, and the kernel hyperparameter was tuned from
. 
IF is the isolation forest method [Liu, Ting, and Zhou2008], which is a treebased unsupervised anomaly detection method. The IF isolates anomalies by randomly selecting an attribute and randomly selecting a split value between the maximum and minimum values of the selected attribute. The number of base estimators was chosen from .

VAE is the variational autoencoder [Kingma and Welling2013]
, which is a density estimation method based on neural networks. With the VAE, the observation is assumed to follow a Gaussian distribution, where the mean and variance are modeled by a neural network that takes latent variables as the input. The latent variable is also modeled by another neural network that takes the attribute vector as the input. We used threelayered feedforward neural networks with 100 hidden units, and the 20dimensional latent space. We optimized the neural network parameters using ADAM. The number of epochs was selected by using the validation data.

MADE is the deep masked autoencoder density estimator [Germain et al.2015], which is used with the proposed method for the density function. The proposed method with corresponds to the MADE. We used the same parameter setting with the proposed method for the MADE, which is described in the next subsection.

KNN is the nearest neighbor method, which classifies instances based on the votes of the neighbors. The number of neighbors was selected from .

SVM is the support vector machine [Schölkopf, Smola, and others2002], which is a kernelbased binary classification method. We used the RBF kernel, and the kernel hyperparameter was tuned from .

RF
is the random forest method
[Breiman2001], which is a meta estimator that fits a number of decision tree classifiers. The number of trees was chosen from
. 
NN
is the feedforward neural network classifier. We used three layers with rectified linear unit (ReLU) activation, where the number of hidden units was selected from
.
Settings of the proposed method
We used Gaussian mixtures with components for the output layer. The number of hidden layers was one, the number of hidden units was 500, the number of masks was ten, and the number of different orderings was ten. The hyperparameter was selected from using the validation data. The validation data were also used for early stopping, where the maximum number of training epochs was 100. We optimized the neural network parameters using ADAM with learning rate .
Results
LOF  OCSVM  IF  VAE  MADE  KNN  SVM  RF  NN  Proposed  

Annthyroid  0.627  0.667  0.700  0.766  0.716  0.510  0.741  0.875  0.596  0.776 
Arrhythmia  0.711  0.718  0.649  0.668  0.694  0.504  0.568  0.621  0.638  0.677 
Cardiotocography  0.569  0.834  0.836  0.726  0.828  0.582  0.846  0.707  0.554  0.873 
HeartDisease  0.581  0.688  0.759  0.729  0.803  0.664  0.722  0.701  0.540  0.825 
InternetAds  0.677  0.836  0.562  0.860  0.780  0.533  0.804  0.601  0.711  0.834 
Ionosphere  0.860  0.951  0.864  0.844  0.821  0.564  0.933  0.883  0.792  0.845 
KDDCup99  0.582  0.993  0.992  0.968  0.995  0.707  0.812  0.939  0.979  0.988 
PageBlocks  0.776  0.930  0.923  0.924  0.847  0.560  0.730  0.675  0.428  0.815 
Parkinson  0.840  0.847  0.748  0.747  0.797  0.640  0.817  0.735  0.733  0.807 
PenDigits  0.898  0.989  0.955  0.915  0.901  0.839  0.999  0.946  0.617  0.993 
Pima  0.586  0.688  0.737  0.677  0.725  0.530  0.649  0.569  0.296  0.744 
Shuttle  0.962  0.918  0.949  0.952  0.927  0.879  0.997  0.999  0.400  0.969 
SpamBase  0.521  0.662  0.781  0.775  0.735  0.536  0.750  0.766  0.621  0.786 
Stamps  0.814  0.890  0.922  0.908  0.902  0.761  0.895  0.868  0.832  0.904 
Waveform  0.729  0.737  0.709  0.751  0.743  0.532  0.801  0.591  0.709  0.800 
Wilt  0.731  0.352  0.596  0.455  0.707  0.503  0.801  0.623  0.649  0.785 
average  0.717  0.794  0.793  0.791  0.807  0.615  0.804  0.756  0.631  0.839 
AUCs on 16 datasets with three training anomalous instances by unsupervised anomaly detection methods (LOF, OCSVM, IF, VAE, MADE) and supervised anomaly detection methods (KNN, SVM, RF, NN, Proposed). Values in bold typeface are not statistically different (at the 5% level) from the best performing method according to a paired ttest. The bottom row shows the average AUC over the datasets.
Annthyroid  Arrhythmia  Cardiotocography  HeartDisease  InternetAds  Ionosphere 
KDDCup99  PageBlocks  Parkinson  PenDigits  Pima  Shuttle 
SpamBase  Stamps  Waveform  Wilt 
and the yaxis is the AUC. The errorbar shows the standard error. The horizontal line is the AUC with
.Table 2 shows the AUC results. The proposed method achieved the highest average AUC among the ten methods. The AUC by the MADE was high compared with the other unsupervised methods, which indicates that the neural autoregressive density estimators would be useful for unsupervised anomaly detection. With some datasets, the AUC by the proposed method was statistically higher than that by the MADE, e.g. Annthyroid, PenDigits and Wilt. There were no datasets where the AUC by the MADE was statistically higher than that by the proposed method. This result indicates that the regularization term in the proposed method works well for improving anomaly detection performance with a few labeled anomalous instances. The SVM achieved the high AUC among supervised binary classifier based methods. However, the performance of the SVM with some datasets was very low, e.g. Arrhythmia and Pima. On the other hand, the proposed method achieved relatively high AUC on all of the datasets. This would be because the proposed method incorporates the characteristics of unsupervised methods by the likelihood term in the objective function as well as the characteristics of supervised methods by the regularization term. Table 3 shows the AUC results with (a) one and (b) five training anomalous instances by the supervised methods. The proposed method also achieved the highest average AUC with these settings. The average AUC by the proposed method increased as the number of training anomalous instances increased, where the AUCs were 0.807, 0.821, 0.839, 0.859 when zero, one, three, five anomalous instances were used for training. The average computational time for training the proposed method was 2.78, 0.53, 0.24, 0.03, 9.50, 0.05, 62.39, 1.36, 0.02, 4.12, 0.05, 0.09, 2.06, 0.03, 0.99, 0.82 seconds on Annthyroid, Arrhythmia, Cardiotocography, HeartDisease, InternetAds, Ionosphere, KDDCup99, PageBlocks, Parkinson, PenDigits, Pima, Shuttle, SpamBase, Stamps, Waveform, Wilt dataset, respectively, using a computer with a Xeon Gold 6130 CPU 2.10GHz.
Figure 3 shows the AUCs on the test data by the proposed method with different hyperparameters trained on datasets with three anomalous instances. The best hyperparameters were different across datasets. For example, the high was better with the PenDigits and Shuttle datasets, the low was better with the Annthyroid and Stamps datasets, and the intermediate was better with the Cardiotocography and Ionosphere datasets. This result indicates that the AUC maximization without the likelihood maximization, which corresponds to the proposed method with the high , is not effective in some datasets. The proposed method achieved the high performance with various datasets by automatically adapting the using the validation data to control the balance of the likelihood maximization and the AUC maximization.
5 Conclusion
We have proposed a supervised anomaly detection method based on neural autoregressive models. With the proposed method, the neural autoregressive model is trained so that the likelihood of normal instances is maximized and the likelihood of anomalous instances is lower than that of normal instances. The proposed method can detect anomalies in the region where there are no normal instances as well as in the region where anomalous instances are located closely. We have experimentally confirmed the effectiveness of the proposed method using 16 datasets. Although our results have been encouraging to date, our approach can be further improved in a number of ways. First, we would like to extend our framework for semisupervised setting [Blanchard, Lee, and Scott2010], where unlabeled instances as well as labeled anomalous and normal instances are given. Second, we plan to incorporate other neural density estimators including VAE to our framework.
References
 [Akcay, AtapourAbarghouei, and Breckon2018] Akcay, S.; AtapourAbarghouei, A.; and Breckon, T. P. 2018. Ganomaly: Semisupervised anomaly detection via adversarial training. arXiv preprint arXiv:1805.06725.
 [Aleskerov, Freisleben, and Rao1997] Aleskerov, E.; Freisleben, B.; and Rao, B. 1997. Cardwatch: A neural network based database mining system for credit card fraud detection. In IEEE/IAFE Computational Intelligence for Financial Engineering, 220–226.
 [An and Cho2015] An, J., and Cho, S. 2015. Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2:1–18.
 [Barnett and Lewis1974] Barnett, V., and Lewis, T. 1974. Outliers in statistical data. Wiley.

[Blanchard, Lee, and
Scott2010]
Blanchard, G.; Lee, G.; and Scott, C.
2010.
Semisupervised novelty detection.
Journal of Machine Learning Research
11(Nov):2973–3009.  [Brefeld and Scheffer2005] Brefeld, U., and Scheffer, T. 2005. AUC maximizing support vector learning. In Proceedings of the ICML Workshop on ROC Analysis in Machine Learning.
 [Breiman2001] Breiman, L. 2001. Random forests. Machine learning 45(1):5–32.
 [Breunig et al.2000] Breunig, M. M.; Kriegel, H.P.; Ng, R. T.; and Sander, J. 2000. LOF: identifying densitybased local outliers. ACM SIGMOD Record 29(2):93–104.
 [Burda, Grosse, and Salakhutdinov2015] Burda, Y.; Grosse, R.; and Salakhutdinov, R. 2015. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.
 [Campos et al.2016] Campos, G. O.; Zimek, A.; Sander, J.; Campello, R. J.; Micenková, B.; Schubert, E.; Assent, I.; and Houle, M. E. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30(4):891–927.
 [Chandola, Banerjee, and Kumar2009] Chandola, V.; Banerjee, A.; and Kumar, V. 2009. Anomaly detection: A survey. ACM Computing Surveys 41(3):15.
 [Cortes and Mohri2004] Cortes, C., and Mohri, M. 2004. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems, 313–320.
 [Das et al.2016] Das, S.; Wong, W.K.; Dietterich, T.; Fern, A.; and Emmott, A. 2016. Incorporating expert feedback into active anomaly discovery. In 16th International Conference on Data Mining, 853–858. IEEE.
 [Das et al.2017] Das, S.; Wong, W.K.; Fern, A.; Dietterich, T. G.; and Siddiqui, M. A. 2017. Incorporating feedback into treebased anomaly detectionw. arXiv preprint arXiv:1708.09441.
 [Dinh, Krueger, and Bengio2014] Dinh, L.; Krueger, D.; and Bengio, Y. 2014. NICE: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516.
 [Dinh, SohlDickstein, and Bengio2016] Dinh, L.; SohlDickstein, J.; and Bengio, S. 2016. Density estimation using real NVP. arXiv preprint arXiv:1605.08803.
 [Dokas et al.2002] Dokas, P.; Ertoz, L.; Kumar, V.; Lazarevic, A.; Srivastava, J.; and Tan, P.N. 2002. Data mining for network intrusion detection. In NSF Workshop on Next Generation Data Mining, 21–30.
 [Eskin2000] Eskin, E. 2000. Anomaly detection over noisy data using learned probability distributions. In International Conference on Machine Learning.
 [Friedland, Gentzel, and Jensen2014] Friedland, L.; Gentzel, A.; and Jensen, D. 2014. Classifieradjusted density estimation for anomaly detection and oneclass classification. In SIAM International Conference on Data Mining, 578–586.
 [Fujimaki, Yairi, and Machida2005] Fujimaki, R.; Yairi, T.; and Machida, K. 2005. An approach to spacecraft anomaly detection problem using kernel feature space. In International Conference on Knowledge Discovery in Data Mining, 401–410.
 [Gao, Cheng, and Tan2006] Gao, J.; Cheng, H.; and Tan, P.N. 2006. A novel framework for incorporating labeled examples into anomaly detection. In Proceedings of the 2006 SIAM International Conference on Data Mining, 594–598. SIAM.
 [Germain et al.2015] Germain, M.; Gregor, K.; Murray, I.; and Larochelle, H. 2015. MADE: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, 881–889.
 [Görnitz et al.2013] Görnitz, N.; Kloft, M.; Rieck, K.; and Brefeld, U. 2013. Toward supervised anomaly detection. Journal of Artificial Intelligence Research 46:235–262.
 [Hido et al.2011] Hido, S.; Tsuboi, Y.; Kashima, H.; Sugiyama, M.; and Kanamori, T. 2011. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems 26(2):309–336.
 [Hodge and Austin2004] Hodge, V., and Austin, J. 2004. A survey of outlier detection methodologies. Artificial Ntelligence Review 22(2):85–126.
 [Idé and Kashima2004] Idé, T., and Kashima, H. 2004. Eigenspacebased anomaly detection in computer systems. In International Conference on Knowledge Discovery and Data Mining, 440–449.
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Kingma and Dhariwal2018] Kingma, D. P., and Dhariwal, P. 2018. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039.
 [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114.

[Laxhammar, Falkman, and
Sviestins2009]
Laxhammar, R.; Falkman, G.; and Sviestins, E.
2009.
Anomaly detection in sea traffic  a comparison of the Gaussian mixture model and the kernel density estimator.
In International Conference on Information Fusion, 756–763.  [Liu, Ting, and Zhou2008] Liu, F. T.; Ting, K. M.; and Zhou, Z.H. 2008. Isolation forest. In Proceeding of the 8th IEEE International Conference on Data Mining, 413–422. IEEE.
 [Markou and Singh2003] Markou, M., and Singh, S. 2003. Novelty detection: a review. Signal processing 83(12):2481–2497.
 [Mukkamala, Sung, and Ribeiro2005] Mukkamala, S.; Sung, A.; and Ribeiro, B. 2005. Model selection for kernel based intrusion detection systems. In Adaptive and Natural Computing Algorithms, 458–461. Springer.
 [Munawar, Vinayavekhin, and De Magistris2017] Munawar, A.; Vinayavekhin, P.; and De Magistris, G. 2017. Limiting the reconstruction capability of generative neural network using negative learning. In 27th International Workshop on Machine Learning for Signal Processing. IEEE.
 [Nadeem et al.2016] Nadeem, M.; Marshall, O.; Singh, S.; Fang, X.; and Yuan, X. 2016. Semisupervised deep neural network for network intrusion detection. In KSU Conference on Cybersecurity Education, Research and Practice.
 [Oord, Kalchbrenner, and Kavukcuoglu2016] Oord, A. v. d.; Kalchbrenner, N.; and Kavukcuoglu, K. 2016. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759.
 [Parra, Deco, and Miesbach1996] Parra, L.; Deco, G.; and Miesbach, S. 1996. Statistical independence and novelty detection with information preserving nonlinear maps. Neural Computation 8(2):260–269.
 [Patcha and Park2007] Patcha, A., and Park, J.M. 2007. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Computer Networks 51(12):3448–3470.
 [Pedregosa et al.2011] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. 2011. Scikitlearn: Machine learning in python. Journal of Machine Learning Research 12:2825–2830.
 [Pimentel et al.2018] Pimentel, T.; Monteiro, M.; Viana, J.; Veloso, A.; and Ziviani, N. 2018. A generalized active learning approach for unsupervised anomaly detection. arXiv preprint arXiv:1805.09411.
 [Raiko et al.2014] Raiko, T.; Li, Y.; Cho, K.; and Bengio, Y. 2014. Iterative neural autoregressive distribution estimator NADEk. In Advances in Neural Information Processing Systems, 325–333.

[Rapaka, Novokhodko, and
Wunsch2003]
Rapaka, A.; Novokhodko, A.; and Wunsch, D.
2003.
Intrusion detection using radial basis function network on sequences of system calls.
In International Joint Conference on Neural Networks, volume 3, 1820–1825.  [Schölkopf et al.2001] Schölkopf, B.; Platt, J. C.; ShaweTaylor, J.; Smola, A. J.; and Williamson, R. C. 2001. Estimating the support of a highdimensional distribution. Neural Computation 13(7):1443–1471.
 [Schölkopf, Smola, and others2002] Schölkopf, B.; Smola, A. J.; et al. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
 [Shewhart1931] Shewhart, W. A. 1931. Economic control of quality of manufactured product. ASQ Quality Press.
 [Singh and Silakari2009] Singh, S., and Silakari, S. 2009. An ensemble approach for feature selection of cyber attack dataset. arXiv preprint arXiv:0912.1014.
 [Suh et al.2016] Suh, S.; Chae, D. H.; Kang, H.G.; and Choi, S. 2016. Echostate conditional variational autoencoder for anomaly detection. In International Joint Conference on Neural Networks, 1015–1022.
 [Uria et al.2016] Uria, B.; Côté, M.A.; Gregor, K.; Murray, I.; and Larochelle, H. 2016. Neural autoregressive distribution estimation. Journal of Machine Learning Research 17(1):7184–7220.
 [Uria, Murray, and Larochelle2013] Uria, B.; Murray, I.; and Larochelle, H. 2013. RNADE: The realvalued neural autoregressive densityestimator. In Advances in Neural Information Processing Systems, 2175–2183.
 [Van Den Oord et al.2016] Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A. W.; and Kavukcuoglu, K. 2016. WaveNet: A generative model for raw audio. In SSW, 125.
 [Wong et al.2003] Wong, W.K.; Moore, A. W.; Cooper, G. F.; and Wagner, M. M. 2003. Bayesian network anomaly pattern detection for disease outbreaks. In International Conference on Machine Learning, 808–815.
 [Xu et al.2018] Xu, H.; Chen, W.; Zhao, N.; Li, Z.; Bu, J.; Li, Z.; Liu, Y.; Zhao, Y.; Pei, D.; Feng, Y.; et al. 2018. Unsupervised anomaly detection via variational autoencoder for seasonal kpis in web applications. In World Wide Web Conference, 187–196.
 [Yamanaka et al.2019] Yamanaka, Y.; Iwata, T.; Takahashi, H.; Yamada, M.; and Kanai, S. 2019. Autoencoding binary classifiers for supervised anomaly detection. arXiv preprint arXiv:1903.10709.
 [Yamanishi et al.2004] Yamanishi, K.; Takeuchi, J.I.; Williams, G.; and Milne, P. 2004. Online unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery 8(3):275–300.
 [Yan et al.2003] Yan, L.; Dodier, R. H.; Mozer, M.; and Wolniewicz, R. H. 2003. Optimizing classifier performance via an approximation to the WilcoxonMannWhitney statistic. In International Conference on Machine Learning, 848–855.
 [Yeung and Ding2003] Yeung, D.Y., and Ding, Y. 2003. Hostbased intrusion detection using dynamic and static behavioral models. Pattern recognition 36(1):229–243.