Unsupervised anomaly detection aims to detect samples that deviate from the norm of the data, with no annotation available. This makes the problem challenging as the only available information is the internal structure of a dataset. Graph-based methods in general make use of the internal data structure, therefore these methods can be used for unsupervised anomaly detection as well. The dataset can be represented by a graph where each node corresponds to a sample and each edge describes a connection between two samples. In a weighted graph, the weight of each edge indicates the similarity between two samples. Then, the weighted graph degree of a node is the sum of the edge weights that are incident to that node. This measure can be considered as a intuitive representation for normality, since in a dataset, the populated dense clusters are strong indicators of normality and samples in these clusters usually have high degree.
The above intuition is not clearly observed from graph degree formulation (sum of the values in a row of similarity matrix). Therefore, in this manuscript, we explicitly formulate graph degree in several points of view so that the formulations clearly show that the measure is fit for normality score. Our contributions are as follows.
We provide a spectral graph clustering based analysis of graph degree and show how the analysis guides us to use fully connected graphs.
We provide a kernel mean feature based and a maximum mean discrepancy based analysis of graph degree for fully connected graphs and show how the analysis guides us to use universal kernels.
Adopting fully connected graphs with a particular choice of a universal kernel, we evaluate an anomaly detection method based on graph degree and show its higher performance on average over 10 datasets, compared to other unsupervised anomaly detection methods.
We show that the kernel analysis allows a parametric approach to deal with challenging anomaly cases and we provide an extensive analysis on the effect of the parameter on the method’s accuracy.
2 Unsupervised Anomaly Detection: A Brief Review
An extensive review of unsupervised anomaly detection methods can be found in [Goldstein and Uchida2016]. Here we follow the same categorization to briefly review unsupervised anomaly detection algorithms by two main categories: nearest neighbor (-nn) based and clustering based methods.
For a sample , represented by the feature , the -nn based anomaly detection [Angiulli and Pizzuti2002] defines the following as an anomaly measure.
where is the set of nearest neighbors of sample . This simple measure is observed to accurately highlight global anomalies, i.e. anomalies that are far away from all normal classes. However, it is not as accurate in detecting local anomalies, i.e. anomalies that are significantly closer to a particular normal cluster compared to others, yet still they form an anomaly case for that particular cluster. In order to address the drawback of -nn baseline, variants with local density measures have been introduced. These methods LOF [Breunig et al.2000], COF [Tang et al.2002] exploit the measure in Eq. 2. LOF uses a direct Euclidean neighborhood definition, whereas COF replaces Euclidean distance with shortest path distance.
For further robustness to samples that are at intersection of two normal clusters, INFLO enhances the neighborhood of with including the reverse -nn neighbors of as well as its -nn neighbors. Here, reverse -nn corresponds to the samples which have X in their
-nn. Other research in this category of studies include assigning an anomaly probability instead of a raw measure (LoOP[Kriegel et al.2009]) and an automatic selection framework for of -nn (LOCI [Papadimitriou et al.2003], aLOCI [Papadimitriou et al.2003]).
Another category in unsupervised anomaly detection is clustering-based methods. CBLOF [He et al.2003]between a sample feature to its nearest cluster centroid multiplied with number of samples in the cluster is treated as an anomaly measure. In order to prevent anomalies that form small clusters, for small clusters this distance is taken between the data sample and the closest large cluster centroid. There is also a variant of CBLOF without the scaling with number of data samples in a cluster (UCBLOF [Goldstein and Uchida2016]). As an extension to CBLOF, LDCOF [Papadimitriou et al.2012] exploits the following ratio as an anomaly measure:
This is done in order to handle local anomalies as well. There is also another study CMGOS [Goldstein2014]
which follows LDCOF, but uses Mahalanobis distance instead of Euclidean. All clustering based methods generally make use of k-means due to its linear complexity, however this makes the methods very dependent on the choice of k.
3 Graph Degree as a Normality Score
In this section we will analyze the graph degree from three points of view: a) spectral graph based clustering and b) kernel feature mean and, c) maximum mean discrepancy points of view.
3.1 A Spectral graph clustering analysis
A basic spectral graph clustering method [Sarkar and Boyer1996] aims to find a cluster that maximizes the average intra-cluster similarity. Consider a cluster indicator vector , entries of which takes 1 if a sample belongs to the cluster and 0 otherwise. Then the average intra-similarity of the cluster for can be written as follows:
In Eq. 4,
is a symmetric similarity (or affinity) matrix whereindicates the similarity between nodes and . In Eq. 4, numerator is the sum of similarities of every sample pair in the class and denominator is the total number of samples in that class.
The optimal solution maximizing Eq. 4 is intractable, therefore one can relax the vector
to have real values. Then, optimal solution is the eigenvectorof
with maximum eigenvalue due to Rayleigh Quotient[Sarkar and Boyer1996]. Other solutions can also be obtained with other eigenvectors, where the eigenvalue indicates the wellness of solution (higher the better). This is straightforward as .
From an anomaly detection point of view, the clusters that are populated and dense can be considered to correspond to normal clusters. Combining such clusters would highlight normal data, therefore suppressing anomaly data. The clusters with high intra-cluster similarity are dense clusters. Since eigenvectors are soft indicators of all clusters and since eigenvalues are measures of denseness, a straightforward combination would be as follows.
Although this combination might seem intuitive and it covers the denseness assumption (due to weighting with ), it is not clear whether the combination also highlights populated clusters. This is because the eigenvectors are only assumptions to binary cluster indicators and it is not analyzed if it makes sense to combine them blindly. Next, we provide an analysis of eigenvector values and we provide extension to Eq. 5 in order to combine clusters by weighting both dense and populated clusters.
Assume for a cluster , the corresponding eigenvector takes the same value within the cluster and zero otherwise. Then this value will be , since the eigenvector is normalized. Note here that is the number of samples in cluster . Then, the combination in Eq. 5 can be considered as a weighted sum of binary cluster indicators where the weight is .
An intuitive combination of binary cluster indicators would include the term directly instead of its square-root, since a populated cluster is a direct indicator of normality. Therefore, we multiply Eq. 5 with . It can be easily seen that . Therefore, adding this multiplicative term to Eq. 5, we end up with Eq. 6.
Next, we show that is equivalent to the graph degree.
Since the eigenvectors are orthonormal due to symmetric , they form a basis and any vector can be written as a linear combination of the eigenvectors where weights are determined by the dot product of the vector with eigenvectors. Thus, one can rewrite the vector 1 as a combination of eigenvectors as follows:
From the eigenvector relation , one can rewrite Eq. 8 as follows.
Here, we have shown the equivalence of the graph degree and a spectral graph clustering based normality score. The normality score highlights populated and dense clusters. This analysis is a theoretical justification of the soundness of graph degree as a normality score from a clustering perspective.
In order to fully exploit the assumptions in this analysis, one needs to have a fully connected graph. This is explained as follows. Constraining the connectivity of the graph to a rule enforces some entries of the similarity matrix to be zero. This might incorrectly assign zero similarity to samples in the same cluster. This is not desired since then the nominator of Eq. 4 would not exactly be the sum of all possible pair similarities within a cluster.
Thus, the spectral graph clustering based approach guides us to have a fully connected graph.
3.2 A Mean Kernel Feature Based Analysis
Here we investigate graph degree as a kernel based normality score and discuss this interpretation. It should be noted that the analysis applies to fully connected graphs which have been favored by our spectral graph clustering based analysis above.
Let us assume that for the fully connected graph, the graph edge weights are defined by a kernel, i.e. , where . Here is the kernel feature of a data sample . Then, one can rewrite Eq. 10 as follows.
In Eq. 11, is a matrix containing kernel features for all samples. It can be easily realized that , where
is the empirical estimation of the kernel feature mean andis the total number of samples in the dataset.
For a probability distributionp which the data is sampled from, the mean embedding of p into the Hilbert space is given by the kernel feature mean [Gretton et al.2012]. The mapping is injective, i.e. each p is mapped to a unique element , for universal kernels including the RBF kernel [Gretton et al.2012].
Thus, for universal kernels, kernel mean feature is a descriptor of the probability distribution where the data is drawn and no other distribution can be described by that mean feature. Hence, the similarity of a sample to the mean in kernel space is then a sound normality score. We provide further analysis about this and relation to maximum mean discrepancy in Section 3.3.
In the light of the above analysis, we consider a universal kernel -RBF kernel- given in Eq. 12.
In Eq. 12, corresponds to feature related to sample . Then, defining the normality as the degree of the kernel matrix , allows us to have a parametric approach where the behavior of the method varies with varying .
This is especially important, since a parametric approach would help to overcome some limitations of the method. By using the RBF kernel, a non-linear warping of the Euclidean space with parameter is possible and thus with the right , some challenging anomaly types can be handled. One toy example is illustrated in Fig. 2. The normal classes are generated by 2 densely connected clusters and there exists a local anomaly near the first cluster indicated by red. The second normal cluster is slightly more loosely connected compared to the first normal cluster. Sample features are simply Cartesian coordinates. The figure illustrates two selections of and next to each element, that element’s rank is written where the ranking is based on descending normality score. The radii of the balls that represent elements are scaled accordingly to their normality score, larger elements mean larger normality score. As it can be observed, while selecting can correctly push the local anomaly to the end of the rank, selecting assigns less normality scores to some elements of the second normal cluster than what is assigned to abnormal sample. It is observed in this toy example that an optimal warping of the space would help to detect challenging anomaly types as well.
3.3 Relation to Maximum Mean Discrepancy
In this section, we provide a supplementary analysis that also encourages using universal kernels. In particular, we show how the graph degree is related to a kernel based method for two sample problem [Gretton et al.2012]. The two sample problem is defined as follows.
Let and be two distributions and , observations drawn from and respectively. Then the problem is to determine if . A related measure to the similarity of two distributions is the maximum mean discrepancy (). Let be a class of functions , then and its empirical approximation are defined as follows.
Recently, a kernel method for the two sample problem was proposed [Gretton et al.2012]. The main result of this paper is as follows. Let be a unit ball in reproducing kernel Hilbert space defined on metric space with associated kernel . Then,
An unbiased test for squared is given as follows.
The corresponding empirical estimation can be obtained as follows.
Next, we investigate for a special case where , i.e. a case where the dataset contains of only one element and that element is also contained in dataset . Let us assume that we have an RBF kernel, thus . Consider now that we form a graph out of and the affinity matrix is defined by , where . Let denote the degree of sample in this graph. Then we can rewrite the first term in in Eq. 15 as follows.
where is the average degree of a sample in the dataset .
The second term in Eq. 15 reduces to 1 since dataset contains only one sample, whereas the third term can be written as follows.
Note that since the only element of is equal to . Then, combining terms, one can rewrite Eq. 15 for our special datasets as follows.
Based on the measure in Eq. 19, if we define an anomaly measure for a sample in a dataset as where , then the anomaly measure is inversely correlated with the graph degree. For a sorting based method, i.e. anomaly samples are determined according to their positions in a sorted data based on anomaly measure, then the graph degree is the only determining factor for anomaly measure. This is due to the constant value of the first two terms of Eq. 19.
From a distribution discrepancy point of view, corresponds to the discrepancy between the data generating probability distribution , from which the data is sampled from and a distribution which is defined as a Dirac delta function at sample , i.e. it is a deterministic process. In this sense, the similarity of a sample to the entire dataset is proportional with the the inverted and thus with graph degree.
However, inverted can indicate similarity to other datasets as well, since the mean embedding of two different probability distributions can be same. Only way to avoid this is to select universal kernels such that the mean embedding of probability distribution is injective. Therefore, the based analysis also suggests selecting universal kernels.
4 Experimental Results
In this section, we evaluate the graph degree based anomaly detection in widely used anomaly detection datasets and compare it to the widely used unsupervised anomaly detection methods. For a fair comparison, we follow the datasets, methods and evaluation metrics in the benchmark paper[Goldstein and Uchida2016]. Next, we briefly describe these and our implementation details.
4.1 Implementation Details
We perform feature standardization as a preprocessing step, i.e. we make the mean of each feature zero and we make the standard deviation of each feature one across the dataset. Based on our theoretical analysis in spectral graph clustering analysis point-of-view, we choose to have a fully connected graph, where each node is connected to the other. Based on our theoretical analysis in kernel mean feature and maximum mean discrepancy point of views, due to its universal property, we use RBF kernel in Eq.12 to form the kernel matrix with an empirical selection of . The effect of on the performance will be analyzed in detail. In order to eliminate the effect of data dimension, we always normalize the term in Eq. 12 with the data dimension. After the degree vector of is calculated as in Eq. 10, the anomaly score is simply obtained by inverting .
We use Breast Cancer Wisconsin [O. L. Mangasarian1990], Pen-Based Recognition of Handwritten Text -Global and Local [Bache and Lichman], Letter Recognition [Micenkova et al.2014], Speech Accent Data [Micenkova et al.2014], Landsat Satellite [Bache and Lichman], Thyroid Disease [Bache and Lichman], Statlog Shuttle [Bache and Lichman], Object Images ALOI [Geusebroek et al.2005] and KDD-Cup99 HTTP [Bache and Lichman] datasets. Due to limited space, we omit description of each dataset in detail and suggest the reader to refer to [Goldstein and Uchida2016] for detailed information about the datasets.
4.3 Compared Methods
The compared methods include (a) -nn based methods : -nn [Angiulli and Pizzuti2002], -nn [Ramaswamy et al.2000], LOF [Breunig et al.2000], LOFUB [Goldstein and Uchida2016], COF [Tang et al.2002], INFLO [Jin et al.2006], LoOP [Kriegel et al.2009], LOCI [Papadimitriou et al.2003], aLOCI [Papadimitriou et al.2003]. (b) clustering based methods: CBLOF [He et al.2003], uCBLOF, LDCOF [Papadimitriou et al.2012], CMGOS-Red [Goldstein and Uchida2016], CMGOS-Reg [Goldstein and Uchida2016], CMGOS-MCD [Goldstein and Uchida2016] and (c) other methods: HBOS [Goldstein et al.2012], rPCA [Kwitt and Hofmann2007], oc-SVM [Schölkopf et al.2001] and -oc-SVM [Schölkopf et al.2001].
4.4 Evaluation Metrics
An anomaly detection method often generates an anomaly score, not a hard classification result. Therefore, a common evaluation strategy in anomaly detection is to threshold this anomaly score and form a receiver operating curve where each point is the true positive and false positive rate of the anomaly detection result corresponding to a threshold. Then, the area under the curve (AUC) of RoC curve is used as an evaluation of the anomaly detection method [Goldstein and Uchida2016].
We compare the AUC of the GDBA with other methods in the anomaly detection datasets. The results are given in Table 1. Note that the results denoted by NA correspond to methods that take too long to implement (more than 12 hours) for that dataset. Table 1 illustrates that the GDBA outperforms other methods on average. GDBA’s performance on individual datasets however does not show a clear leading accuracy in each dataset individually. Still, GDBA is in top three best performing methods in 5/10 datasets. As each dataset is different from each other in structure, the optimal nonlinear warping of the Euclidean space , i.e. the optimal parameter , can be different for the best accuracy.
In Fig. 3, we illustrate the effect of the parameter on the accuracy via observing the average performance (AUC) of the method across all datasets while changing . We observe a somewhat robust performance when . The mean of the method’s performance and the standard deviation (indicated with ) across this sigma interval is given in Table 2.
In the same table, we also report the best AUC measure with the manually selected in all datasets under the algorithm named GDBAbest. Note that the best sigma is chosen by a grid search across . We observe that the potential of the method is high considering its leading performance on average by a good margin. Moreover, 8 out of 10 datasets, the method is in top three. This is of course if the hyper-parameter is selected optimally.
In our experiments, we could not find a direct correlation between and data statistics such as feature dimensionality, number of samples, anomaly percentage, number of normal or abnormal classes etc. However, the selection of is dependent on the type of anomalies that the dataset contains. For example, the datasets with local anomalies require small , this is due to the need to separate the local anomalies from the normal data such that they become global anomalies. Thyroid, aloi and speech are such datasets and the optimal performance is obtained at very low sigma. For datasets where a single normal class is present and anomalies are not local, the sigma is best selected high (see bcancer dataset). This is in order to cluster the normal data as much as possible. Since the anomalies are global they will not be merged with normal cluster even if is high. However, in pen-g dataset, although there is one large normal class, the anomalies are not so far away from the normal cluster, so very large does not help in this case, but a middle value is optimal.
The optimal is difficult to predict without using annotated validation set. In a semi-supervised anomaly detection setting, where we have the labels in a validation set, can be optimized to maximize the performance in validation dataset. But this is not the case we are interested in this work, as we aim to have an unsupervised method.
We have analyzed the graph degree measure as a normality score in spectral graph clustering based point of view and mean kernel feature based point of view and introduced how it is related to maximum mean discrepancy. Our analyses verify the theoretical soundness of graph degree as a normality score, moreover they guide us to use fully connected graphs with universal kernels. In our experiments, we have observed that a simple graph degree based anomaly detection with an empirical parameter selection outperforms all other methods on average in 10 datasets. We have also observed that optimal parameter selections would result in a much improved performance. We have investigated the performance dependency on this parameter and optimal selection of the parameter per dataset. We have observed that the method is somewhat robust to in an interval. Finally, the automatic selection of remains as an open problem as we have found that there is no direct correlation of to any data statistics such as feature dimensionality, number of samples, or ratio of outliers.
- [Alon1986] N. Alon. Eigenvalues and expanders. Combinatorica, 6(2):83–96, 1986.
[Angiulli and Pizzuti2002]
F. Angiulli and C. Pizzuti.
Fast outlier detection in high dimensional spaces.In In European Conference on Principles of Data Mining and Knowledge Discovery, pages 15–27, Berlin, Heidelberg, 2002. Springer.
[Aytekin et al.2014]
C. Aytekin, S. Kiranyaz, and M. Gabbouj.
Automatic object segmentation by quantum cuts.
22nd International Conference on Pattern Recognition (ICPR 2014), pages 112–117, 2014.
[B. E. Boser and Vapnik1992]
I. M. Guyon B. E. Boser and V. N. Vapnik.
A training algorithm for optimal margin classifiers.
Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992.
[Bache and Lichman]
K. Bache and M. Lichman.
Uci machine learning repository.http://archive.ics.uci.edu/ml. Accessed: 2018-01-03.
- [Breunig et al.2000] M.M. Breunig, H.P. Kriegel, R.T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In Proceedings of the ACM International Conference on Management of Data, pages 93–104, Dallas, Texas, 2000. ACM Press.
[Geusebroek et al.2005]
J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders.
The amsterdam library of object images.
Int J Comput Vision, 61(1):103–112, 2005.
- [Goldstein and Uchida2016] M. Goldstein and S. Uchida. A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE, 11, 2016.
- [Goldstein et al.2012] M. Goldstein, A. Dengel, P. B. Gibbons, and C. Faloutsos. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. In KI-2012: Poster and Demo Track, pages 59–63, 2012.
- [Goldstein2014] M. Goldstein. nomaly Detection in Large Datasets. PhD thesis, University of Kaiserslautern, 2014.
- [Gretton et al.2012] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723–773, 2012.
- [He et al.2003] Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9):1641–1650, 2003.
- [Jin et al.2006] W. Jin, A. K. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. PAKDD, 6:577–593, 2006.
- [Kriegel et al.2009] H. P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Loop: Local outlier probabilities. In Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM’09), pages 1649–1652, New York, NY, 2009. ACM Press.
- [Kwitt and Hofmann2007] R. Kwitt and U. Hofmann. Unsupervised anomaly detection in network traffic by means of robust pca. In Proceedings of the International Multi-Conference on Computing in the Global Information Technology (ICCGI’07), page 37, Washington, DC, 2007. IEEE Computer Society Press.
- [Micenkova et al.2014] B. Micenkova, B. McWilliams, and I. Assent. Learning outlier ensembles: The best of both worlds—supervised and unsupervised. In In: Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description under Data Diversity (ODD2), pages 51–54, New York, NY, 2014.
[O. L. Mangasarian1990]
W. H. Wolberg O. L. Mangasarian, W. N. Street.
Breast cancer diagnosis and prognosis via linear programming.SIAM News., 23(5):1–18, 1990.
- [Papadimitriou et al.2003] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering, pages 315–326, Los Alamitos, CA, 2003. IEEE Computer Society Press.
- [Papadimitriou et al.2012] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Nearest-neighbor and clustering based anomaly detection algorithms for rapidminer. In Proceedings of the 3rd RapidMiner Community Meeting and Conferernce (RCOMM 2012), pages 1–12, 2012.
- [Ramaswamy et al.2000] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’00), pages 1207–1216, New York, NY, 2000. ACM Press.
- [Sarkar and Boyer1996] S. Sarkar and K. L. Boyer. Quantitative measures of change based on feature organization: Eigenvalues and eigenvectors. In IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR’96), pages 478–483, 1996.
- [Schölkopf and Smola2002] B. Schölkopf and A. J. Smola, editors. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
- [Schölkopf et al.2001] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.
- [Shi and Malik2000] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
- [Tang et al.2002] J. Tang, Z. Chen, A. Fu, and D. Cheung. Enhancing effectiveness of outlier detections for low density patterns. In Advances in Knowledge Discovery and Data Mining, pages 535–548, 2002.