1 Introduction
Anomaly detection aims at identifying abnormal patterns that deviate significantly from the normal behavior, which is ubiquitous in a multitude of application domains, such as cybersecurity [15], medical care [19], and surveillance video profiling [14]. Formally, anomaly detection problem can be viewed as density estimation from the data distribution [23]
: anomalies tend to reside in the low probability density areas. Although anomaly detection has been wellstudied in the machine learning community, how to conduct unsupervised anomaly detection from highly complex and unstructured data effectively, is still a challenge.
Unsupervised anomaly detection aims to detect outliers without labeled data for the scenario that only a small number of labeled anomalous data combined with plenty of unlabeled data are available, which is common in realworld applications. Existing methods for unsupervised anomaly detection can be divided into three categories: reconstruction based methods, clustering based methods, and oneclass classification based methods. Reconstruction based methods, such as PCA
[5] based approaches [18, 10]and autoencoder based approaches
[21, 22, 23, 20], assume that outliers cannot be effectively reconstructed from the compressed lowdimensional projections. Clustering based methods [17, 6] aim at density estimation of data points and usually adopt a twostep strategy [3] that performs dimensionality reduction firstly and then clustering. Different from previously mentioned categories, oneclass classification based methods [7, 11, 1] make the effort to learn a discriminative boundary between the normal and abnormal instances.Although the abovementioned methods had their fair share of success in anomaly detection, most of these methods neglect the complex correlation among data samples. As shown in Fig. 1, the conventional methods attempt to conduct feature learning on the original observed feature space of data samples, while the correlation among similar samples is ignored, which can be exploited during feature learning by propagating more representative features from the neighbors to generate highquality embedding for anomaly detection. However, modeling correlation among samples is far different from those conventional feature learning models, in which highly nonlinear structure needs to be captured. Therefore, how to effectively incorporate both the original feature and relation structure of samples into an integrated feature learning framework for anomaly detection is still an open problem.
To alleviate the abovementioned problems, in this paper, we propose a method of Correlation aware unsupervised Anomaly detection via Deep Gaussian Mixture Model (CADGMM), which considers both the original feature and the complex correlation among data samples for feature learning. Specifically, the relations among data samples are correlated firstly in forms of a graph structure, in which, the node denotes the sample and the edge denotes the correlation between two samples from the feature space. Then, a dualencoder that consists of a graph encoder and a feature encoder, is employed in CADGMM to encode both the feature and correlation of samples into the lowdimensional latent space jointly, followed by a decoder for data reconstruction. Finally, a separate estimation network as a Gaussian Mixture Model is utilized to estimate the density of the learned latent embedding. To verify the effectiveness of our algorithms, we conduct experiments on multiple realworld datasets. Our experimental results demonstrate that, by considering correlation among data samples, CADGMM significantly outperforms the stateoftheart on unsupervised anomaly detection tasks.
2 Notations and Problem Statement
In this section, we formally define the frequentlyused notations and the studied problem.
Definition 1
Graph is denoted as with nodes and edges, in which, is a set of nodes, is a set of edges and represents an edge between node and node . is an feature matrix with each row corresponding to a content feature of a node, where indicates the dimension of features. Adjacency Matrix of a graph is denoted as , which can be used to represent the topologies of a graph. The scalar element if there exists an edge between node and node , otherwise, .
Problem 1
Anomaly detection: Given a set of input samples , each of which is associated with a dimension feature , we aim to learn a score function
, to classify sample
based on the threshold :(1) 
where denotes the label of sample , with 0 being the normal class and 1 the anomalous class.
3 Method
In this section, we introduce the proposed CADGMM in detail. CADGMM is an endtoend joint representation learning framework for unsupervised anomaly detection. As shown in Fig. 2, CADGMM consists of three modules named dualencoder, feature decoder, and estimation network, respectively. Specifically, the relations among data samples in the original feature space are correlated firstly in form of the graph structure. In the constructed graph, the node denotes the sample and the edge denotes the correlation between two samples in the feature space. Then, a dualencoder that consists of a graph encoder and a feature encoder, is employed to encode both the feature and correlation information of samples into the lowdimensional latent space jointly, followed by a feature decoder for sample reconstruction. Finally, a separate estimation network is utilized to estimate the density of the learned latent embedding in the framework of Gaussian Mixture Model, and the anomalies can be detected by measuring the energy of the samples with respect to a given threshold.
3.1 Graph Construction
To explore the correlation among nonstructure data samples for feature learning, we explicitly construct a graph structure to correlate the similar samples from the feature space. More specifically, given a set of input samples , we employ NN algorithm on sample to determine its nearest neighbors in the feature space. Then, an undirected edge is assigned between and its neighbor . Finally, an undirected graph is constructed, with being the node set, being the edge set, and being the feature matrix of nodes. Based on the constructed graph, the feature affinities among samples are captured explicitly, which can be used during feature learning by performing message propagation mechanism on them.
3.2 DualEncoder
In order to obtain sufficient representative highlevel sample embedding, DualEncoder consists of a feature encoder and a graph encoder to encode the original feature of samples and the correlation among them respectively.
To encode the original sample features , feature encoder employs a
layers MultiLayer Perceptron (MLP) to conduct a nonlinear feature transform, which is as follows:
(2) 
where , , and are the input, output, the trainable weight and bias matrix of ()th layer respectively, , and is the initial input of the encoder.
denotes an activation function such as ReLU or Tanh. Finally, the final feature embedding
= is obtained from the output of the last layer in MLP.To encode the correlation among the samples, a graph attention layer [16] is employed to adaptively aggregate the representation from neighbor nodes, by performing a shared attentional mechanism on the nodes:
(3) 
where indicates the importance weight of node to node ,
denotes the neural network parametrized by weights
and that shared by all nodes andis the number of hidden neurons in
, denotes the concatenate operation. Then, the final importance weight is normalized through the softmax function:(4) 
where denotes the neighbors of node , which is provided by adjacency matrix , and the final node embedding can be obtained by the weighted sum based on the learned importance weights as follows:
(5) 
Given the learned embedding and , a fusion module is designed to fuse the embeddings from heterogeneous data source into a shared latent space, followed by a fully connected layer to obtain the final sample embedding :
(6) 
(7) 
where W and b are the trainable weight and bias matrix, and indicates the elementwise plus operator of two matrices.
3.3 Feature Decoder
Feature decoder aims at reconstructing the sample features from the latent embedding :
(8) 
where , , and are the input, output, the trainable weight and bias matrix of ()th layer of decoder respectively, , and is the initial input of the decoder. Finally, the reconstruction is obtained from the last layer of decoder:
(9) 
3.4 Estimate Network
To estimate the density of the input samples, a Gaussian Mixture Model is leveraged in CADGMM over the learned latent embedding. Inspired by DAGMM [23], a subnetwork consists of several fully connected layers is utilized, which takes the reconstruction error preserved lowdimentional embedding as input, to estimate the mixture membership for each sample. The reconstruction error preserved lowdimentional embedding Z is obtained as follows:
(10) 
where is the reconstruction error embedding and denotes the distance metric such as Euclidean distance or cosine distance. Given the final embedding Z as input, estimate network conducts membership prediction as follows:
(11) 
where , , and are the input, output, the trainable weight and bias matrix of ()th layer of estimate network respectively, , , and the mixturecomponent membership is calculated by:
(12) 
where is the predicted membership of mixture components for samples. With the predicted sample membership, the parameters of GMM can be calculated to facilitate the evaluation of the energy/likelihood of input samples, which is as follows:
(13) 
where and are the means and covariance of the th component distribution respectively, and the energy of samples is as follows:
(14) 
3.5 Loss Function and Anomaly Score
The training objective of CADGMM is defined as follows:
(15) 
where the first term is reconstruction error used for feature reconstruction, the second is sample energy, which aims to maximize the likelihood to observed samples, the third is covariance penalization, used for solving singularity problem as in GMM [23] by penalizing small values on the diagonal entries of covariance matrix, and the last is embedding penalization, which serves as a regularizer to impose the magnitude of normal samples as small as possible in the latent space, to deviate the normal samples from the abnormal ones. , , and are three parameters which control the trade off between different terms.
The anomaly score is the sample energy , and based on the measured anomaly scores, the threshold in Eq. 1 can be determined according to the distribution of scores, e.g. the samples of topk scores are classified as anomalous samples.
4 Experiments
In this section, we will describe the experimental details including datasets, baseline methods, and parameter settings, respectively.
Database  # Dimensions  # Instances  Anomaly ratio 

KDD99  120  494,021  0.2 
Arrhythmia  274  452  0.15 
Satellite  36  6,435  0.32 
4.1 Dataset
Three benchmark datasets are used in this paper to evaluate the proposed method, including KDD99, Arrhythmia, and Satellite. The statistics of datasets are shown in Table 1.

KDD99 The KDD99 10 percent dataset [2] contains 494021 samples with 41 dimensional features, where 34 of them are continuous and 7 are categorical. Onehot representation is used to encode the categorical features, resulting in a 120dimensional feature for each sample.

Arrhythmia The Arrhythmia dataset [2] contains 452 samples with 274 dimensional features. We combine the smallest classes including 3, 4, 5, 7, 8, 9, 14, 15 to form the outlier class and the rest of the classes are inliers class.

Satellite The Satellite dataset [2] has 6435 samples with 36 dimensional features. The smallest three classes including 2,4,5 are combined to form the outliers and the rest are inliers classes.
4.2 Baseline Methods

One Class Support Vector Machines (OCSVM)
[4] is a classic kernel method for anomaly detection, which learns a decision boundary between the inliers and outliers. 
Isolation Forests (IF) [8] conducts anomaly detection by building trees using randomly selected split values across sample features, and defining the anomaly score as the average path length from a specific sample to the root.

Deep Structured Energy Based Models (DSEBM) [21]
is a deep energybased model, which aims to accumulate the energy across the layers. DSEBMr and DSEBMe are utilized in
[21] by taking the energy and reconstruction error as the anomaly score respectively. 
Deep Autoencoding Gaussian Mixture Model (DAGMM) [23] is an autoencoder based method for anomaly detection, which consists of a compression network for dimension reduction, and an estimate network to perform density estimation under the Gaussian Mixture Model.

ALAD [20] is based on bidirectional GANs for anomaly detection by deriving adversarially learned features and uses reconstruction errors based on the learned features to determine if a data sample is anomalous.
4.3 Parameter Settings
The parameter settings in the experiment for different datasets are as follows:

KDD99 For KDD99, CADGMM is trained with 300 iterations and =1024 for graph construction with =15, which is the batch size for training. =4, =0.1, =0.005, =10.

Arrhythmia For Arrhythmia, CADGMM is trained with 20000 iterations and =128 for graph construction with =5, which is the batch size for training, =2, =0.1, =0.005, =0.001.

Satellite For Satellite, CADGMM is trained with 3000 iterations and =512 for graph construction with =13, =4, =0.1, =0.005, =0.005.
The architecture details of CADGMM on different datasets are shown in Table 2, in which, means a fully connected layer with input neurons and output neurons. Similarly, means a graph attention layer with dimensional input and dimensional output. The activation function for all datasets is set as Tanh. For the baseline methods, we set the parameters by grid search. We independently run each experiment 10 times and the mean values are reported as the final results.
2 Dataset  DualEnc.  Feature Dec.  Estimate Net.  

Feature Trans.  Graph Attn.  MLP  
2 KDD99  FC(120,64)  GAT(120,32)  FC(32, 8)  FC(8,32)  FC(10,20) 
FC(64,32)  FC(32,64)  FC(20,8)  
FC(64,120)  FC(8,4)  
2 Arrhythmia  FC(274,32)  GAT(274,32)  FC(32, 2)  FC(2,10)  FC(4,10) 
FC(10,274)  FC(10,2)  
2 Satellite  FC(36,16)  GAT(36,16)  FC(16, 2)  FC(2,16)  FC(4,10) 
FC(16,36)  FC(10,4)  
2 
2 Method  KDD99  Arrhythmia  Satellite  
Precision  Recall  F1  Precision  Recall  F1  Precision  Recall  F1  
2 OCSVM [4]  74.57  85.23  79.54  53.97  40.82  45.81  52.42  59.99  61.07 
IF [8]  92.16  93.73  92.94  51.47  54.69  53.03  60.81  94.89  75.40 
DSEBMr [21]  85.21  64.72  73.28  15.15  15.13  15.10  67.84  68.61  68.22 
DSEBMe [21]  86.19  64.66  73.99  46.67  45.65  46.01  67.79  68.56  68.18 
DAGMM [23]  92.97  94.42  93.69  49.09  50.78  49.83  80.77  81.6  81.19 
AnoGAN [13]  87.86  82.97  88.65  41.18  43.75  42.42  71.19  72.03  71.59 
ALAD [20]  94.27  95.77  95.01  50  53.13  51.52  79.41  80.32  79.85 
CADGMM  96.01  97.53  96.71  56.41  57.89  57.14  81.99  82.75  82.37 
2 
5 Results and Analysis
In this section, we will demonstrate the effectiveness of the proposed method by presenting results of our model on anomaly detection task, and provide a comparison with the stateoftheart methods.
5.1 Anomaly Detection
As in previous literatures [21, 23, 20], in this paper, Precision, Recall and F1
score are employed as the evaluation metrics. Generally, we expect the values of these evaluation metrics as big as possible. The sample with high energy is classified as abnormal and the threshold is determined based on the ratio of anomalies in the dataset. Following the settings in
[21, 23], the training and test sets are split by 1:1 and only normal samples are used for training the model.The experimental results shown in Table 3
demonstrate that the proposed CADGMM significantly outperforms all baselines in various datasets. The performance of CADGMM is much higher than traditional anomaly detection methods such as OCSVM and IF, because of the limited capability of feature learning or the curse of dimensionality. Moreover, CADGMM also significantly outperforms all other deep learning based methods such as DSEBM, DAGMM, AnoGAN, and ALAD, which demonstrates that additional correlation among data samples facilitates the feature learning for anomaly detection. For small datasets such as Arrhythmia, we can find that traditional methods such as IF are competitive compared with conventional deep learning based method such as DSEBM, DAGMM, AnoGAN, and ALAD, which might because that the lack of sufficient training data could have resulted in poorer performance of the data hungry deep learning based methods, while CADGMM is capable of leveraging more data power given the limited data source, by considering the correlation among data samples.
2 Radio  CADGMM  DAGMM  OCSVM  

Precision  Recall  F1  Precision  Recall  F1  Precision  Recall  F1  
2 1%  95.53  97.04  96.28  92.01  93.37  92.68  71.29  67.85  69.53 
2%  95.32  96.82  96.06  91.86  93.40  92.62  66.68  52.07  58.47 
3%  94.83  96.33  95.58  91.32  92.72  92.01  63.93  44.70  52.61 
4%  94.62  96.12  95.36  88.37  89.89  89.12  59.91  37.19  45.89 
5%  94.35  96.04  95.3  85.04  86.43  85.73  11.55  33.69  17.20 
2 
5.2 Impact of noise data
In this section, we study the impact of noise data for the training of CADGMM. To be specific, 50% of randomly split data samples are used for testing, while the rest 50% combined with 1% to 5% anomalies are used for training.
As shown in Table 4
, with the increase of noise data, the performance of all baselines degrade significantly, especially for OCSVM, which tends to be more sensitive to noise data because of its poor ability of feature learning on highdimensional data. However, CADGMM performs stable with different ratios of noise and achieves stateoftheart even 5% anomalies are injected into the training data, which demonstrates the robustness of the proposed method.
5.3 Impact of values
In this section, we evaluate the impact of different values during the graph construction on CADGMM.
More specifically, we conduct experiments on all three datasets by varying the number of from 5 to 19, and the experimental results are illustrated in Fig. 3. During training, the batch sizes are set as 1024, 128, and 512 for KDD99, Arrhythmia, and Satellite, respectively, the experimental results show that the changing of value causes only a little fluctuation of performance on all datasets with different settings, which demonstrates that CADGMM is less sensitive to the value and easy to use.
5.4 Embedding Visualization
In order to explore the quality of the learned embedding, we make a comparison of the visualization of sample representation for different methods in Fig. 6. Specifically, we take the lowdimensional embeddings of samples learned by DAGMM and CADGMM, as the inputs to the tSNE tool [9]. Here, we randomly choose 40000 data samples from the test set of KDD99 for visualization, and then we generate visualizations of the sample embedding on a twodimensional space, in which blue colors correspond to the normal class while orange the abnormal class. We can find that CADGMM achieves more compact and separated clusters compared with DAGMM. The results can also explain why our approach achieves better performance on anomaly detection task.
6 Conclusion
In this paper, we study the problem of correlation aware unsupervised anomaly detection, which considers the correlation among data samples from the feature space. To cope with this problem, we propose a method named CADGMM to model the complex correlation among data points to generate highquality lowdimensional embeddings for anomaly detection. Extensive experiments on realworld datasets demonstrate the effectiveness of the proposed method.
Acknowledgement
This work was supported in part by National Natural Science Foundation of China (No. 61172168, 61972187).
References
 [1] (2013) Enhancing oneclass support vector machines for unsupervised anomaly detection. In SIGKDD, pp. 8–15. Cited by: §1.
 [2] (2013) UCI machine learning repository, 2013. URL http://archive. ics. uci. edu/ml 5. Cited by: 1st item, 2nd item, 3rd item.
 [3] (20090701) Anomaly detection: a survey. ACM Computing Surveys 41 (3) (English (US)). External Links: Document, ISSN 03600300 Cited by: §1.

[4]
(2001)
Oneclass svm for learning in image retrieval.
. In ICIP, pp. 34–37. Cited by: 1st item, Table 3.  [5] (2003) Principal component analysis. Technometrics 45 (3), pp. 276. Cited by: §1.

[6]
(2012)
Robust kernel density estimation
. Journal of Machine Learning Research 13 (Sep), pp. 2529–2565. Cited by: §1.  [7] (2003) Improving oneclass svm for anomaly detection. In Proceedings of the 2003 International Conference on Machine Learning and Cybernetics, Vol. 5, pp. 3077–3081. Cited by: §1.
 [8] (2008) Isolation forest. In ICDM, pp. 413–422. Cited by: 2nd item, Table 3.
 [9] (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.4.

[10]
(2012)
Robust feature selection and robust pca for internet traffic anomaly detection
. In IEEE INFOCOM, pp. 1755–1763. Cited by: §1.  [11] (2006) Using an ensemble of oneclass svm classifiers to harden payloadbased anomaly detection systems.. In ICDM, Vol. 6, pp. 488–498. Cited by: §1.
 [12] (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR. Cited by: 5th item.
 [13] (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In IPMI, pp. 146–157. Cited by: 5th item, Table 3.
 [14] (2018) Realworld anomaly detection in surveillance videos. In CVPR, pp. 6479–6488. Cited by: §1.
 [15] (2011) Fast anomaly detection for streaming data. In IJCAI, Cited by: §1.
 [16] (2018) Graph attention networks. Cited by: §3.2.
 [17] (2011) Group anomaly detection using flexible genre models. In NIPS, pp. 1071–1079. Cited by: §1.
 [18] (2010) Robust pca via outlier pursuit. In NIPS, pp. 2496–2504. Cited by: §1.
 [19] (2018) CXNetm1: anomaly detection on chest xrays with imagebased deep learning. IEEE Access 7, pp. 4466–4477. Cited by: §1.
 [20] (2018) Adversarially learned anomaly detection. In ICDM, pp. 727–736. Cited by: §1, 6th item, Table 3, §5.1.
 [21] (2016) Deep structured energy based models for anomaly detection. In ICML, pp. 1100–1109. Cited by: §1, 3rd item, Table 3, §5.1.
 [22] (2017) Anomaly detection with robust deep autoencoders. In SIGKDD, pp. 665–674. Cited by: §1.
 [23] (2018) Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In ICLR, Cited by: §1, §1, §3.4, §3.5, 4th item, Table 3, §5.1.
Comments
There are no comments yet.