Anomaly detection is one of the most important tasks in artificial intelligence and is widely performed in many areas, such as cyber security[Kwon et al.2017], complex system management [Liu et al.2008], and material inspection [Tagawa et al.2015].
Anomaly detection methods can be categorized into supervised and unsupervised approaches. A supervised approach learns a classification rule that discriminates anomalies from normal data points [Nasrabadi2007]. Figure 1(a) shows the image of the supervised approach. This approach utilizes the label information, which indicates whether each data point is an anomaly or not, and can achieve high detection performance for known anomalies included in a dataset. However, a supervised approach is effective only for limited applications for two reasons. First, it cannot detect the unknown anomalies not included in the dataset. Generally, it is difficult to obtain a training dataset that covers all types of possible anomalies in advance. For example, in malware detection, unknown anomalies will appear one after another. Second, the standard supervised approaches do not work well when there are much fewer number of anomalies than normal data points. This is called a class imbalance problem. Since the anomalies rarely occur, it is difficult to collect a sufficient number of them.
On the other hand, an unsupervised approach does not require anomalies for training, since it tries to model only the normal data points. Figure 1
(b) shows the image of the unsupervised approach. Since this approach detects anomalies by using the difference between normal and test data points, it can detect both known anomalies and unknown anomalies that are away from normal data points. This difference is frequently called the anomaly score. To detect anomalies in high-dimensional and complex data, various methods based on deep learning techniques have been presented. One of the major unsupervised approaches based on deep learning is the autoencoder (AE)[Lyudchik2016, Sakurada and Yairi2014, Ma et al.2013]
. The AE is composed of two neural networks: the encoder and the decoder. The encoder compresses data points into low-dimensional latent representations. The decoder reconstructs data points from latent representations. The parameters of the encoder and decoder are optimized by minimizing the error between data points and reconstructed data points, which is called the reconstruction error. Since the AE is trained with normal data points, it becomes able to reconstruct normal data points and fails to reconstruct anomalies. Therefore, the reconstruction error can be used as the anomaly score. However, an unsupervised approach performs inferiorly to a supervised approach at detecting known anomalies since it does not utilize the label information. Furthermore, even if anomalies are obtained in advance, an unsupervised approach cannot utilize these anomalies for improving detection performance. To handle this problem, the limiting reconstruction capability (LRC), which is the simple extension of the AE, has been presented[Munawar et al.2017]. The LRC maximizes the reconstruction error for the anomaly data points, in addition to minimizing the reconstruction error for normal data points. As a result, the LRC can detect the known anomalies close to normal data points. However, the LRC has a serious drawback: the training of AE in the LRC is unstable. Since the reconstruction error is bounded below but not bounded above, the LRC tends to maximize the reconstruction error for known anomalies rather than to minimize the reconstruction error for normal data points. As a result, the LRC cannot reconstruct the normal data points well.
In this paper, we propose the Autoencoding Binary Classifier (ABC), which is a novel supervised approach to exploit the benefits of both supervised and unsupervised approaches. In the ABC, we regard the conditional probability of the label variable given a data point as the Bernoulli distribution. Its negative log likelihood is modeled by the reconstruction error of the AE. Thus, by minimizing the negative log likelihood, the AE in the ABC is trained so as to minimize the reconstruction error for normal data points and to maximize the reconstruction error for known anomalies. Although the behavior of the AE in the ABC is similar to the LRC, the training of the ABC is stable since its objective function is bounded below and above with respect to reconstruction error. After the training, we can obtain the conditional probability of the label variable given a data point, which is more reasonable anomaly score than the reconstruction error. As shown in Figure1(c), the ABC can detect the anomalies located away from normal data points and detect the anomalies that are close to the known anomalies. In addition, even if training dataset does not contain the known anomalies enough, the ABC can outperform the supervised approaches since the ABC behaves as the unsupervised approach.
The AE is a dimensionality reduction algorithm using neural networks. The AE consists of two parts of neural networks: the encoder and the decoder. The encoder compresses data point into low-dimensional latent representation . The decoder reconstructs data points from latent representation . In the AE, the objective function for each data point is given by
where denotes an arbitrary distance function. The -norm is usually used. This objective function is called the reconstruction error. The parameters of the encoder and decoder are optimized by minimizing the sum of the reconstruction error for all data points.
One of the most widely used variants of the AE is the Denoising autoencoder (DAE)[Vincent et al.2008]
, which tries to reconstruct original data points from the noisy input data. The DAE is estimated by minimizing the following objective function
is a noise from an isotropic Gaussian distribution. Since the DAE is robust against the noise, it is useful in noisy environments. The AE and DAE can be optimized with stochastic gradient descent (SGD)[Kingma and Ba2014].
There are various anomaly detectors based on the AEs [Zhou and Paffenroth2017, Xie et al.2016, Zhai et al.2016, Zong et al.2018]. As a simple way, the reconstruction error can be used for anomaly detection [Sakurada and Yairi2014]. If we trained the AE on normal data points, the reconstruction error for anomalies can be larger than that for normal data points. Therefore, we can detect the anomalies by using the reconstruction error as anomaly score. However, this approach is likely to fail to detect the anomalies near the normal data points. Furthermore, even if anomalies are obtained in advance, the AE cannot utilize the label information.
The LRC is the simple extension of the AE. The LRC tries to minimize the reconstruction error of the AE for normal data points and to maximize the reconstruction error for known anomalies. Suppose that we are given a training dataset , where represents the -th data point, denotes its label, represents a normal data point and represents an anomaly. The objective function of the LRC to be minimized is
where represents the reconstruction error of the AE for data point . By using reconstruction error as anomaly score, the LRC can detect both known and unknown anomalies. However, the training of the LRC is unstable. As shown in Eq. (3), the LRC tries to maximize the reconstruction error for known anomalies and minimize the reconstruction error for normal data points. Since the reconstruction error is bounded below but not bounded above, the LRC tends to maximize the reconstruction error for known anomalies rather than to minimize the reconstruction error for normal data points. As the result, the LRC cannot reconstruct the normal data points well.
3 Proposed methods
We propose a new supervised anomaly detector based on the AE, the Autoencoding Binary Classifiers (ABC), which can accurately detect known and unknown anomalies.
The ABC uses a probabilistic binary classifier for anomaly detection. The probabilistic binary classifier predicts the label from the data point . We assume that the conditional probability of given follows the Bernoulli distribution,
where is called the regression function, and its output range is . We can regard as the probability that is anomaly. Since the regression function gives low values for known anomalies and high values for normal data points by maximizing the likelihood of Eq. (4), the ABC can detect known anomalies. To detect unknown anomalies, we want to make the regression function give low values for unknown anomalies. The ABC uses the reconstruction error of the AE for the regression function as follows:
where is the parameter of the AEs. The ABC can use the reconstruction error of the AE (Eq. (1)) or the DAE (Eq. (2)) for . This function takes one when the reconstruction error is zero, and becomes close to zero asymptotically when the reconstruction error goes to infinity. The range of this regression function is . Since this regression function is based on the AE, it gives low values for unknown anomalies.
Under this definition, we minimize the sum of the following negative log-likelihood of the conditional probability
This objective function is equal to that of the AE (Eq. (1)) when the input data is normal (). Therefore, if the training dataset consists of only normal data points, this model is identical to the AE. On the other hand, when the input data is anomaly (), this model tries to maximize the reconstruction error . The reason is that is monotonically decreasing for the reconstruction error as shown in Figure 2. Note that the reconstruction errors for normal and anomaly data points are reasonably balanced by using the likelihood (Eq. (8)), and thus, we can stably train the ABC, unlike the LRC. Furthermore, the ABC also works well for imbalanced data since it exploits the reconstruction error of the unsupervised AE.
Our model can be optimized with SGD by minimizing sum of Eq. (8) for all data points. Algorithm 1 shows pseudo code of the proposed ABC, where is the minibatch size and is the parameters of the AE in our model. After the training, this model can detect anomalies by using the conditional probability for each data point. is more reasonable anomaly score than the reconstruction error because we can naturally regard the data point as anomaly when .
4 Related Work
There is a large literature on anomaly detection. Here, we briefly review anomaly detection methods by categorizing them into supervised and unsupervised. In this paper, we define a supervised approach as a method that requires anomalies for training, an unsupervised approach as a method that can be learned without labeled anomalies.
If the label information is given perfectly, anomaly detection can be regarded as a binary classification problem. In this situation, supervised classifiers such as support vector machines[Hearst et al.1998], gradient tree boosting [Friedman2002]Dreiseitl and Ohno-Machado2002] are usually used. However, these standard supervised classifiers cannot detect unknown anomalies accurately and do not work well in the class imbalance situations. There are several approaches for imbalanced data such as cost-sensitive and ensemble approaches such as random undersampling boost [Seiffert et al.2010] although these approaches do not aim to detect unknown anomalies. Our ABC also works well for imbalanced data and can detect unknown anomalies since it exploits the reconstruction error of the unsupervised AE. To achieve high detection performance when label information is available for part of the dataset, semi-supervised approaches [Kiryo et al.2017] that utilize both labeled and unlabeled data have been presented. The semi-supervised situation that can use the unlabeled data is out of scope of this paper.
Since the label information is rarely given perfectly in real situation, unsupervised approaches such as the local outlier factor[Breunig et al.2000], one-class support vector machines [Tax and Duin2004], and the isolation forest method [Liu et al.2008] are usually used. Especially, neural network based methods, typically the AE and the variational autoencoder (VAE) [Kingma and Welling2013], have succeeded in detecting anomalies for high-dimensional datasets. However, the unsupervised approaches do not detect known anomalies as accurately as the supervised approach. To solve this problems, the LRC has been presented as shown in Section 2. This is similar approach to ours. However, the training of the LRC becomes unstable since the reconstruction error of the AE in the LRC is bounded below but not bounded above. Opposite to this, our ABC can be trained stably since the objective function of the ABC is bound above with respect to reconstruction error as described in Section 3.
We used the following four datasets.
2D-Toy. In order to explain the strengths of our approach, we use the simple two-dimensional dataset consisting of normal data points, known anomalies that are close to normal data points, and unknown anomalies that are away from normal data points. Fig. 3(a) shows this dataset. We generated two interleaving half-circle distribution near the . We regarded samples from the upper and lower distributions as normal data points and known anomalies, respectively. For unknown anomalies, we generated samples from a Gaussian distribution with a mean and a standard derivation of .
MNIST. The MNIST consists of hand-written images of digits from [LeCun et al.1998]. In this data, we used the digits for normal data points, for known anomalies, and for unknown anomalies. We used of this dataset for training and the remaining for testing.
KDD‘99. The KDD‘99 data was generated using a closed network and hand-injected attacks for evaluating the performance of supervised network intrusion detection. The dimensionality of the data point was . We used ‘10 percent dataset’ for training and ‘corrected dataset’ for test. 222This dataset is available at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. For simplicity, we removed three categorical features (‘protocol type’, ‘service’ and ‘flag’) and duplicated data points. We used ‘neptune’, which is the largest anomaly class, for known anomalies, and used attacks included in ‘R2L’ for unknown anomalies.
CIFAR10. The CIFAR10 is a collection of images that contains 60,000 3232 color images in 10 different classes. We used ‘data_batch_’ for training, and ‘test_batch’ for testing. 333This dataset is available at https://www.cs.toronto.edu/~kriz/CIFAR.html. We used a set of ‘automobile’ images for normal data points, ‘truck’ for known anomalies, and ‘dog’ for unknown anomalies.
The details of datasets are listed in Table 1. We designed the experiments so that detecting unknown anomalies is more difficult than detecting known anomalies in unsupervised fashion except for 2D-Toy dataset.
5.2 Comparison methods
We compared our approach with the following supervised and unsupervised methods:
Supervised methods. As the supervised approaches, we used the LRC, Support Vector Machine (SVM) classifier [Hearst et al.1998], Gradient Tree Boosting (GTB) [Friedman2002, Chen and Guestrin2016], Deep Neural Network (DNN) [Dreiseitl and Ohno-Machado2002], and Random undersampling boost (RUSBoost) [Seiffert et al.2010]
We did experiments in following three settings:
Setting 1. To evaluate detection performance for both known and unknown anomalies, we divided datasets into three subsets: normal data points, known anomalies and unknown anomalies. The normal data points and known anomalies were used for training supervised methods, and only normal data points were used for training unsupervised methods. The unknown anomalies were only used for testing. In this setting, we regard the unknown anomalies as novel anomalies.
Setting 2. This is the same as setting 1 except that unknown anomalies were included in the normal data points. This is the more realistic setting than setting 1. The remaining unknown anomalies were used for test.
To measure the detection performances of supervised approaches in class imbalance situations, we evaluated their performances when the number of known anomalies is changed on MNIST dataset in setting 1. We evaluated the detection performance by using the area under the receiver operating characteristic curve (AUROC).
For our approach, we used neural networks with two hidden layer ( hidden units for 2D-Toy, and hidden units for other datasets) for the encoder and the decoder. We used -norm as the distance function of Eq. (1) and Eq. (2
). We also evaluated our ABC with the DAE with noise from a Gaussian distribution with a standard deviation of
. We used a hyperbolic tangent for the activation function. As the optimizer, we used Adam with batch size 100[Kingma and Ba2014]. The number of latent variables was changed in accordance with the datasets: we set the dimension of z to one for 2D-Toy and the dimension of to
for the other datasets. The maximum number of epochs was set tofor all datasets, and we used early-stopping on the basis of the validation data [Prechelt1998]. We use same network architecture for the AE, DAE and LRC. We used of the training dataset as validation data. Each attribute was linearly normalized into a range by min-max scaling expect for 2D-Toy. We ran all experiments five times each.
Figures 3(b)-(d) show the heatmap of anomaly scores of 2D-Toy by the DNN, the AE and our ABC in setting 1. Table 2 show the AUROC with the proposed method and comparison methods in setting 1 and 2. We used bold to highlight the best result and the results that are not statistically different from the best result according to a pair-wise -test. We used as the p-value.
First, we focus on the supervised approaches: DNN, SVM, GTB, and RUSBoost. Figure 3(b) shows the heatmap of anomaly scores by the DNN, which achieved the highest performance among supervised approaches for detecting known anomalies. This result indicates that supervised approaches can discriminate known anomalies from normal data points accurately. However, these approaches failed to detect unknown anomalies on the opposite side of the known anomalies. Tables 2 also show that these approaches can detect known anomalies but not unknown anomalies accurately on various datasets444Note that although the SVM classifier could detect unknown anomalies among supervised approaches, it performed inferiorly to our ABC on average.. In contrast to these approaches, our ABC can detect known anomalies as accurately as supervised approaches and unknown anomalies better than these approaches.
Second, we focus on the unsupervised approaches: AE, DAE, OCSVM, and IF. Figure 3(c) shows the heatmap of anomaly scores by the AE, which achieved the highest performance among unsupervised approaches. In contrast to supervised approaches, unsupervised approaches can detect unknown anomalies that are located away from normal data points. However, these approaches cannot detect known anomalies as accurately as supervised approaches: they are likely to fail to detect anomalies close to normal data points since these approaches use the difference between normal and observed data points for detection. Table 2 also show that these approaches can detect unknown anomalies but not known anomalies accurately. Furthermore, for CIFAR-10 datasets, these approaches perform poorly because this dataset is most complicated and heavily overlapping dataset in our experiments. Meanwhile, our ABC achieved high detection performances for both known and unknown anomalies on all datasets.
Third, we focus on the LRC. The LRC shows the same performance tendency as our ABC. The LRC can detect both known and unknown anomalies. However, the variance of performance is larger than that of the ABC and sometimes it cannot detect both known and unknown anomalies at all. Furthermore, it performed inferiorly to our ABC in all situations. To compare the stability of training of the LRC and ABC, we plotted in Figure5 the reconstruction error for normal data points and the known anomalies on 2D-Toy dataset. Figure 5(a) shows the mean reconstruction error of the LRC. As the training proceeds the LRC does not reconstruct normal data points. The reason is the LRC tends to maximize the reconstruction error for known anomalies rather than to minimize the reconstruction error for normal data points. This result indicates that the training of the AE in the LRC is unstable. Opposite to this, the ABC can minimize the reconstruction error for normal data points and maximize the reconstruction error for known anomalies as show in Figure 5(b). This indicates the training of the AE in the ABC is stable.
Fourth, we focus on our ABC. Figure 3(d) shows the heatmap of anomaly scores by our ABC. In contrast to the AE, the anomaly scores by our ABC are low for normal data points and high for known anomalies. The reason is that we train the AE in our ABC to minimize the reconstruction error for normal data points and maximize the reconstruction error for known anomalies. On the other hand, the anomaly scores of our ABC for unknown anomalies are also high since our approach fails to reconstruct data points not included in the training dataset. Therefore, our ABC can detect both known and unknown anomalies accurately. Table 2 shows that our ABC achieved the almost equal to or better performance than the other approaches. These results indicate that our ABC is useful in various tasks.
Figures 4 shows the relationship between the AUROC and the number of known anomalies for training in setting 3. Our ABC maintains a high detection performance, whereas the other supervised approaches perform poorly as the number of known anomalies decreases, even RUSBoost, which is designed to work well in class imbalance situations. This result indicates that our ABC is useful when we cannot obtain enough anomalies.
These results indicate that our approach is a good alternative to other approaches: our approach can detect both known and unknown anomalies accurately, and it works well when the number of known anomalies are not enough.
In this paper, we introduced a new anomaly detector Autoencoding Binary Classifiers (ABC). We modeled normal data points with the unsupervised approach, and made it fail to model known anomalies by the supervised approach. We assumed that labels follow a Bernoulli distribution, and modeled its negative log likelihood of the normal label given a data point by using the reconstruction error of the AE. By maximizing the likelihood of the conditional probability, our AE was trained so as to minimize the reconstruction error for normal data points, and to maximize the reconstruction error for known anomalies. Since the proposed method becomes able to reconstruct normal data points and does not reconstruct known and unknown anomalies, it can detect both known and unknown anomalies accurately. The training of the ABC is stable and detection performance is higher than the LRC. In addition, since our approach corresponds to the AE when there are no known anomalies, our approach work well when we do not obtain the sufficient number of known anomalies in advance. We experimentally shows that our approach can detect both known and unknown anomalies accurately and it works well even when the number of known anomalies is not enough. In the future, we will try to apply our approach to various anomaly detection tasks such as network security and malware detection.
- [Breunig et al.2000] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In ACM Sigmod Record, volume 29, pages 93–104. ACM, 2000.
- [Chen and Guestrin2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016.
- [Dreiseitl and Ohno-Machado2002] Stephan Dreiseitl and Lucila Ohno-Machado. Logistic regression and artificial neural network classification models: a methodology review. Journal of Biomedical Informatics, 35(5-6):352–359, 2002.
Jerome H Friedman.
Stochastic gradient boosting.Computational Statistics & Data Analysis, 38(4):367–378, 2002.
- [Hearst et al.1998] Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28, 1998.
- [Hinton and Salakhutdinov2006] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
- [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [Kingma and Welling2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- [Kiryo et al.2017] Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In Advances in Neural Information Processing Systems, pages 1675–1685, 2017.
- [Kwon et al.2017] Donghwoon Kwon, Hyunjoo Kim, Jinoh Kim, Sang C Suh, Ikkyun Kim, and Kuinam J Kim. A survey of deep learning-based network anomaly detection. Cluster Computing, pages 1–13, 2017.
- [LeCun et al.1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- [Liu et al.2008] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 413–422. IEEE, 2008.
- [Lyudchik2016] Olga Lyudchik. Outlier detection using autoencoders. Technical report, 2016.
- [Ma et al.2013] Yunlong Ma, Peng Zhang, Yanan Cao, and Li Guo. Parallel auto-encoder for efficient outlier detection. In Big Data, 2013 IEEE International Conference on, pages 15–17. IEEE, 2013.
- [Munawar et al.2017] Asim Munawar, Phongtharin Vinayavekhin, and Giovanni De Magistris. Limiting the reconstruction capability of generative neural network using negative learning. In Machine Learning for Signal Processing (MLSP), 2017 IEEE 27th International Workshop on, pages 1–6. IEEE, 2017.
- [Nasrabadi2007] Nasser M Nasrabadi. Pattern recognition and machine learning. Journal of electronic imaging, 16(4):049901, 2007.
- [Prechelt1998] Lutz Prechelt. Automatic early stopping using cross validation: quantifying the criteria. Neural Networks, 11(4):761–767, 1998.
- [Sakurada and Yairi2014] Mayu Sakurada and Takehisa Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, page 4. ACM, 2014.
- [Seiffert et al.2010] Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 40(1):185–197, 2010.
- [Tagawa et al.2015] Takaaki Tagawa, Yukihiro Tadokoro, and Takehisa Yairi. Structured denoising autoencoder for fault detection and analysis. In Asian Conference on Machine Learning, pages 96–111, 2015.
- [Tax and Duin2004] David MJ Tax and Robert PW Duin. Support vector data description. Machine Learning, 54(1):45–66, 2004.
- [Vincent et al.2008] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, pages 1096–1103. ACM, 2008.
[Xie et al.2016]
Junyuan Xie, Ross Girshick, and Ali Farhadi.
Unsupervised deep embedding for clustering analysis.In International Conference on Machine Learning, pages 478–487, 2016.
- [Zhai et al.2016] Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei Zhang. Deep structured energy based models for anomaly detection. arXiv preprint arXiv:1605.07717, 2016.
- [Zhou and Paffenroth2017] Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 665–674. ACM, 2017.
[Zong et al.2018]
Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho,
and Haifeng Chen.
Deep autoencoding gaussian mixture model for unsupervised anomaly detection.In International Conference on Learning Representations, 2018.