Anomaly detection has practical importance in a variety of applications such as predictive maintenance, intrusion detection in electronic systems (Patcha and Park, 2007; Jyothsna et al., 2011), faults in industrial systems (Wise et al., 1999), and medical diagnosis (Tarassenko et al., 1995; Quinn and Williams, 2007; Clifton et al., 2011). Predictive maintenance setups usually assume that the normal class of data points is well sampled in the training data whereas the anomaly class is rare and underrepresented. This assumption is relevant because large critical systems usually produce abundant data for normal activities, but it is the anomalous behaviors (which are scarce and evolving) that can be used to proactively forecast imminent failures Thus, the challenge in anomaly detection is to be able to identify new types of anomalies in the test data that are rare or unseen in the available training data.
Local outlier factor (Breunig et al., 2000) is one of the common methodologies used for anomaly detection, which has seen many recent applications including credit card fraud detection (Chen et al., 2007), system intrusion detection (Alshawabkeh et al., 2010), out-of-control detection in freight logistics (Ning and Tsung, 2012), and battery defect diagnosis (Zhao et al., 2017). LOF computes an anomaly score by using the local density of each sample point with respect to the points in its surrounding neighborhood. The local density is inversely correlated with the average distance from a point to its nearest neighbors. The anomaly score in LOF is known as the local outlier factor score; its denominator is the local density of a sample point and its numerator is the average local density of the nearest neighbors of that sample point. LOF assumes that anomalies are more isolated than normal data points such that anomalies have a lower local density, or equivalently, a higher local outlier factor score. LOF uses two hyperparameters: neighborhood size and contamination. The contamination determines the proportion of the most isolated points (points that have the highest local outlier factor scores) to be predicted as anomalies. Figure 1 presents a simple example of LOF, where we set neighborhood size to be 2 and contamination to be 0.25. Since A is the most isolated point in terms of finding the two nearest neighbors among the four points, the LOF method predicts it as an anomaly.
In their original LOF paper, Breunig et al. (2000) proposed some guidelines for determining a range for the neighborhood size.
In principle, the number of neighbors should be lower-bounded by the minimum number of points in a cluster
and upper-bounded by the maximum number of nearest points that can potentially be anomalies.
However, such information is generally not available. Even if such information is available, the optimal
neighborhood size between the lower bound and upper bound is still undefined.
A second hyperparameter in the LOF algorithm is the contamination, which specifies the proportion
of data points in the training set to be predicted as anomalies.
The contamination has to be strictly positive in order to form the decision boundaries in LOF.
In an extreme but not uncommon setting of anomaly detection, there can be zero anomalies in the training data.
In this case, an arbitrary, small threshold has to be chosen for the contamination.
These two hyperparameters are critical to the predictive performance in LOF; however, to the best of our knowledge, no literature has yet focused on tuning both contamination and neighborhood size in LOF for anomaly detection. Since the type and proportion of the anomaly class can be very different between training and testing, the state-of-the-art K-fold cross validation classification error (or accuracy) does not apply in this setting. Therefore, in this paper we propose a novel, heuristic strategy for jointly tuning the hyperparameters in LOF for anomaly detection, and we evaluate this strategy’s performance on both moderate and large data sets in various settings. In addition, we compare the empirical results on real data sets with other benchmark anomaly detection methods, including one-class SUM (Schölkopf et al., 2001) and isolation forest (Liu et al., 2008).
2 Related Work
There have been many variants of LOF in the recent years. Local correlation integral (Loci) proposed by Papadimitriou et. al (2003), provides an automatic, data-driven approach for outlier detection that is based on probabilistic reasoning. Local outlier probability (LoOP)(Kriegel et al., 2009; Kriegel et al., 2011) proposes a normalization of the LOF scores to the interval [0,1] by using statistical scaling to increase usability across different data sets. Incremental and memory-efficient LOF methods (Pokrajac et al., 2007; Salehi et al., 2016) were developed so as to efficiently fit an online LOF algorithm in the data stream. To make LOF feasible in high-dimensional setting, random projection is a common preprocessing step for dimension reduction; it is based on the Johnson-Lindenstrauss lemma (Dasgupta, 2000; Bingham and Mannila, 2001). Projection-based approximate nearest neighbor methods (Liu et al., 2005; Jones et al., 2011) and approximate LOF methods (Lazarevic and Kumar, 2005; Aggarwal and Yu, 2001; De Vries et al., 2010) have been proposed and evaluated in recent literature.
In this paper, we propose a heuristic method to tune the LOF for anomaly detection. LOF uses two hyperparameters: the first is neighborhood size (), which defines the neighborhood for the computation of local density; the second is contamination (), which specifies the proportion of points to be labeled as anomalies. In other words, determines the score for ranking the training data, whereas determines the cutoff position for anomalies. Let be the training data with a collection of data points, . If is large, dimension-reduction methods should be used to preprocess the training data and project them onto a lower-dimensional subspace. In predictive maintenance, the anomaly proportion in the training data is usually low as opposed to the test data, which might contain unseen types of anomalies. If the anomaly proportion in the training data is known, we can use that as the value for and tune only the neighborhood size ; otherwise, both and would have to be tuned in LOF, which commonly is the case. We assume that anomalies have a lower local relative density as compared to normal points, so the top points with the lowest local density (highest local outlier factor scores) are predicted as anomalies.
To jointly tune and , we first define a grid of values for and , and compute the local outlier factor score for each training data point under different settings of and . For each pair of and , let and
denote the sample mean and variance, respectively, of the natural logarithm of local outlier factor scores for thepredicted anomalies (outliers). Accordingly, and denote the sample mean and variance, respectively, of the local outlier factor scores for the top predicted normal points (inliers), which have the highest local outlier factor scores. For each pair of and , we define the standardized difference in the mean log local outlier factor scores between the predicted anomalies and normal points as
This formulation is similar to that of the classic two-sample
-test statistic. The optimalfor each fixed is defined as . If is known a priori, we only need to find the that maximizes the standardized difference between outliers and inliers for that . A logarithm transformation serves to symmetrize the distribution of local outlier factor scores and alleviate the influence of extreme values. Instead of focusing on all predicted normal points, we focus only on those
normal points that are most similar to the predicted anomalies in terms of their local outlier factor scores. The intuition behind our focus mimics the idea of support vector machine(Cortes and Vapnik, 1995) in that we want to maximize the difference between the predicted anomalies and the normal points that are close to the decision boundary.
We then consider the case when is not known a priori. Suppose that for each
, the log local outlier factor scores for outliers form a random sample of Gaussian distribution with meanand variance , and that the log local outlier factor scores for inliers form a random sample of Gaussian distribution with mean and variance . Then given , approximately follows a noncentral distribution with degrees of freedom and noncentrality parameter . We cannot directly compare the largest standardized difference across different values of because follows different noncentral distributions depending on
. Instead, we can compare the quantiles that correspond toin each respective noncentral distribution so that the comparison is on the same scale. Define
, where the random variablefollows a noncentral distribution with degrees of freedom and noncentrality parameter. Thus, the optimal is the one where is the largest quantile in the corresponding
distribution as compared to the others. Since we do not observe the noncentrality parameter, it will be estimated by plugging in sample means and variances for the true population counterparts. Figure2 displays the flowchart of procedures for training a tuned LOF model.
4 Experimental Results
4.1 Performance measures
We use both the area under the ROC curve (AUC) and the F1 score to evaluate the goodness of the optimal parameters that are tuned by the proposed metric. The F1 score is defined as
The F1 score is a measure of precision and recall at a particular threshold value on the ROC curve, and AUC is an average over all the threshold values.
4.2 Evaluations on small data sets
We first assess the performance of the proposed tuning metric on three small data sets by checking how the selected optimal neighborhood size and contamination perform in terms of the AUC and F1 score. Since the data dimension is low, no dimension reduction is needed in the data preprocessing.
Polygons data: This synthetic training set contains 1,600 points, which are uniformly sampled within a mixture of two randomly generated polygons as shown in Figure 3, where one polygon has a higher density than the other. Since no points are sampled outside the boundaries of the polygons, the anomaly proportion is 0 in the training set. The 10,000 data points in the synthetic validation set form a dense two-dimensional (2-D) mesh grid with both axes ranging from –10 to 10. The points inside the true boundaries are labeled as normal; the points outside are labeled anomalies.
Balls data: This synthetic training set contains 1,600 points, which are uniformly sampled within a mixture of two three-dimensional (3-D) balls as shown in Figure 4, where the ball centered at the origin has a smaller radius than the ball centered at (5,5,5). Since no points are sampled outside the boundary of the balls, the anomaly proportion is 0 in the training set. The 637 points in the synthetic validation set form two 3-D cubes, with each cube enveloping one of the training balls. The points inside the true boundaries are labeled as normal; the points outside are labeled anomalies.
Metal data: This engineering data set is used in Wise et al. (1999); it consists of the eight engineering variables from a LAM 9600 metal etcher over the course of etching 129 wafers (108 normal wafers and 21 wafers in which faults were intentionally induced during the same experiments). In the training set, we include 90% of the normal wafers data. The validation set is the entire data set.
For both the polygons data and the balls data, the grid of values for neighborhood ranges from 10 to 50 incrementing by 1, and the three contamination levels considered are 0.006, 0.008, and 0.01. In the metal data, the grid for neighborhood ranges from 10 to 25 incrementing by 1, and the three contamination levels considered are 0.08, 0.1, and 0.12. Table 2 shows the results on the three small data sets, where the proposed method produces a tuned LOF that has both F1 score and AUC very close to the optimal upper bound values on the prespecifed grids.
4.3 Evaluations on large data sets
To evaluate the performance of the proposed tuning metric on large data sets, Gaussian random projection is implemented as a preprocessing step for dimension reduction. We do not discuss how to choose the dimension of the projected subspace, because dimension reduction is only for the purpose of computation feasibility in this paper. The computation cost of LOF is np times the cost of a
-nearest-neighbor (KNN) query, which is needed in searching the neighborhood for each sample point. For low-dimensional data, a grid-based approach can be used to search for nearest neighbors so that the KNN query is constant in
. For high-dimensional data, the KNN query on average takes, with the worst case of , which would make the LOF algorithm extremely slow for large, high-dimensional data. In this paper, we use random projection for dimension reduction to make the computation feasible for the repetitive running of the LOF algorithm on large data sets. In practice, we recommend that the dimension of the data be reduced to the largest subspace that the computing resources can handle.
We assessed performance of the LOF method on the following data sets:
Spheres100: We generated 100 mixtures of 100-dimensional spheres data. In each mixture, the training set contains 100,000 points uniformly sampled from a random number (between 2 and 10) of spheres. Since no points are sampled outside the boundary of the spheres, the anomaly proportion is 0 in the training set. For the validation set in each mixture, 10,000 points are randomly sampled around each of the training spheres with 0.05 probability of being outside the boundaries (anomalies).
Cubes100: We generated 100 mixtures of 100-dimensional cubes data. In each mixture, the training set contains 100,000 points uniformly sampled from a random number (between 2 and 10) of cubes with dimension equal to 100. Since no points are sampled outside the boundary of the cubes, the anomaly proportion is 0 in the training set. For the validation set in each mixture, 10,000 points are randomly sampled around each of the training cubes with 0.05 probability of being outside the boundaries (anomalies).
This data set is a subset from the original KDD Cup 1999 data set from the UCI Machine Learning Repository(Hettich and Bay, 1999), where the service attribute is smtp. The training set consists of 9,598 samples of normal internet connections and 36 continuous variables. The validation set contains 1,183 anomalies out of 96,554 samples (1.2%).
Http: This data set is also a subset from the original KDD Cup 1999 data set from UCI Machine Learning Repository (Hettich and Bay, 1999), where the service attribute is http. The training set consists of 61,886 samples of normal internet connections and 36 continuous variables. The validation set contains 4,045 anomalies out of 623,091 samples (0.6%).
Credit: This credit card fraud detection data set has been collected during a research collaboration of Worldline and the Machine Learning Group of Université Libre de Bruxelles (Dal Pozzolo et al., 2015), which contains 284,807 records and 28 continuous variables. The training set consists of 142,157 normal credit card activity records. The validation set contains 492 fraudulent activity records out of 284,807 samples (0.2%).
This data set is a subset from the publicly available MNIST database of handwritten digits(LeCun et al., 1998). The training set consists of 12,665 samples for digits “0”and “1”, which are defined as normal data in this specific application. The validation set consists of 10,000 samples for all 10 digits, where there are 7,885 (78.9%) anomalies.
Table 4 shows the performance of the tuning metric on the synthetic Cubes and Spheres data. After tuning, the mean F1 score and AUC after tuning are high and approach the best upper bound values in both cases, indicating good predictive performance of the tuned parameter settings. For the reduced subspace dimension of 3 with sample size 100,000, the average running time for LOF in both cases is smaller than 6 seconds, which shows the scalability of the tuning algorithm for a large sample size. Table 5
compares the tuned LOF versus other benchmark anomaly detection methods (one-class SVM and isolation forest) on large real data sets. For the first three data sets (Http, Smtp, and Credit), Gaussian random projection is used to reduce the dimension to 3. For the Mnist data, the reduced subspace dimension is 10 because the original data is high-dimensional. We repeat the random projection process 10 times and compare the mean (standard error) of the F1 score and the AUC between different methods. LOF is tuned using the proposed metric, whereas the hyperparameters in one-class SVM and isolation forest are chosen to be the configuration that has the highest F1 and AUC on the validation set. In the Http and Smtp data sets, the performance of the tuned LOF is comparable to the best result from one-class SVM; in Credit and Mnist, the tuned LOF has a higher mean F1 score and AUC than the other two benchmark methods. Note that the F1 scores from all methods are low on the Credit data, which might imply that the anomalies are not fully identifiable from the normal data in this case.
|Data||Mean F1||Mean AUC||Mean computation|
|Data||Mean F1||Mean AUC|
We propose a heuristic methodology for jointly tuning the hyperparameters of contamination and neighborhood size in the LOF algorithm, and we comprehensively evaluated this methodology on both small and large data sets. In small data sets, the tuned hyperparameters correspond well to settings that have the highest F1 score and AUC. In large data sets, Gaussian random projection is used in the preprocessing step for dimension reduction, whose sole purpose is to improve computation efficiency. The predictive performance of the tuned LOF is comparable to the predictive performance with the best results from one-class SVM on the Http and Smtp data, and it outperforms all the other methods on Credit and Mnist data.
Although the proposed tuning method works reasonably well in general, it is by no means guaranteed that the tuned parameters will maximize either the F1 score or the AUC. This is exactly the challenge in anomaly detection where the test data differ from the training in terms of the anomaly type and proportion. In order for the proposed tuning method to have good performance, we need to assume that the normal data are well sampled in the training data and that the anomalies can be identified from the normal data in terms of their relative local density. As long as those assumptions are not severely violated, the proposed metric (which is based on maximizing the standardized difference) will manage to arrive at a decent parameter configuration that differentiates the anomalies from the normal data. In future work, extending the tuning methodology to the setting of incremental LOF for streaming data is worth exploring.
Authors would like to thank Anne Baxter, Principal Technical Editor at SAS, for her assistance in creating this manuscript.
- Aggarwal and Yu (2001) Aggarwal, C. C. and P. S. Yu (2001). Outlier detection for high dimensional data. In ACM Sigmod Record, Volume 30, pp. 37–46. ACM.
- Alshawabkeh et al. (2010) Alshawabkeh, M., B. Jang, and D. Kaeli (2010). Accelerating the local outlier factor algorithm on a gpu for intrusion detection systems. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 104–110. ACM.
- Bingham and Mannila (2001) Bingham, E. and H. Mannila (2001). Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 245–250. ACM.
- Breunig et al. (2000) Breunig, M. M., H.-P. Kriegel, R. T. Ng, and J. Sander (2000). Lof: identifying density-based local outliers. In ACM sigmod record, Volume 29, pp. 93–104. ACM.
- Chen et al. (2007) Chen, M.-C., R.-J. Wang, and A.-P. Chen (2007). An empirical study for the detection of corporate financial anomaly using outlier mining techniques. In Convergence Information Technology, 2007. International Conference on, pp. 612–617. IEEE.
- Clifton et al. (2011) Clifton, L., D. A. Clifton, P. J. Watkinson, and L. Tarassenko (2011). Identification of patient deterioration in vital-sign data using one-class support vector machines. In Computer Science and Information Systems (FedCSIS), 2011 Federated Conference on, pp. 125–131. Citeseer.
- Cortes and Vapnik (1995) Cortes, C. and V. Vapnik (1995). Support-vector networks. Machine learning 20(3), 273–297.
- Dal Pozzolo et al. (2015) Dal Pozzolo, A., O. Caelen, R. A. Johnson, and G. Bontempi (2015). Calibrating probability with undersampling for unbalanced classification. In Computational Intelligence, 2015 IEEE Symposium Series on, pp. 159–166. IEEE.
Dasgupta, S. (2000).
Experiments with random projection.
Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 143–151. Morgan Kaufmann Publishers Inc.
- De Vries et al. (2010) De Vries, T., S. Chawla, and M. E. Houle (2010). Finding local anomalies in very high dimensional space. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 128–137. IEEE.
- Hettich and Bay (1999) Hettich, S. and S. Bay (1999). The uci kdd archive [http://kdd. ics. uci. edu]. irvine, ca: University of california. Department of Information and Computer Science 152.
- Jones et al. (2011) Jones, P. W., A. Osipov, and V. Rokhlin (2011). Randomized approximate nearest neighbors algorithm. Proceedings of the National Academy of Sciences.
- Jyothsna et al. (2011) Jyothsna, V., V. R. Prasad, and K. M. Prasad (2011). A review of anomaly based intrusion detection systems. International Journal of Computer Applications 28(7), 26–35.
- Kriegel et al. (2009) Kriegel, H.-P., P. Kröger, E. Schubert, and A. Zimek (2009). Loop: local outlier probabilities. In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 1649–1652. ACM.
- Kriegel et al. (2011) Kriegel, H.-P., P. Kroger, E. Schubert, and A. Zimek (2011). Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 13–24. SIAM.
- Lazarevic and Kumar (2005) Lazarevic, A. and V. Kumar (2005). Feature bagging for outlier detection. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 157–166. ACM.
- LeCun et al. (1998) LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324.
- Liu et al. (2008) Liu, F. T., K. M. Ting, and Z.-H. Zhou (2008). Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE.
- Liu et al. (2005) Liu, T., A. W. Moore, K. Yang, and A. G. Gray (2005). An investigation of practical approximate nearest neighbor algorithms. In Advances in neural information processing systems, pp. 825–832.
- Ning and Tsung (2012) Ning, X. and F. Tsung (2012). A density-based statistical process control scheme for high-dimensional and mixed-type observations. IIE transactions 44(4), 301–311.
- Patcha and Park (2007) Patcha, A. and J.-M. Park (2007). An overview of anomaly detection techniques: Existing solutions and latest technological trends. Computer networks 51(12), 3448–3470.
- Pokrajac et al. (2007) Pokrajac, D., A. Lazarevic, and L. J. Latecki (2007). Incremental local outlier detection for data streams. In Computational intelligence and data mining, 2007. CIDM 2007. IEEE symposium on, pp. 504–515. IEEE.
Quinn, J. A. and C. K. Williams (2007).
Known unknowns: Novelty detection in condition monitoring.In
Iberian Conference on Pattern Recognition and Image Analysis, pp. 1–6. Springer.
- Salehi et al. (2016) Salehi, M., C. Leckie, J. C. Bezdek, T. Vaithianathan, and X. Zhang (2016). Fast memory efficient local outlier detection in data streams. IEEE Transactions on Knowledge and Data Engineering 28(12), 3246–3260.
- Schölkopf et al. (2001) Schölkopf, B., J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001). Estimating the support of a high-dimensional distribution. Neural computation 13(7), 1443–1471.
- Tarassenko et al. (1995) Tarassenko, L., P. Hayton, N. Cerneaz, and M. Brady (1995). Novelty detection for the identification of masses in mammograms.
Wise et al. (1999)
Wise, B. M., N. B. Gallagher, S. W. Butler, D. D. White, and G. G. Barna
A comparison of principal component analysis, multiway principal component analysis, trilinear decomposition and parallel factor analysis for fault detection in a semiconductor etch process.Journal of Chemometrics 13(3-4), 379–396.
- Zhao et al. (2017) Zhao, Y., P. Liu, Z. Wang, L. Zhang, and J. Hong (2017). Fault and defect diagnosis of battery for electric vehicles based on big data analysis methods. Applied Energy 207, 354–362.