While majority of records in databases are normal or expected, some of them can be unusual or unexpected. Those data records or instances which are deviated significantly and do not conform to normal data are called anomalies or outliers. Automatic detection of such anomalies using computers is the task of Anomaly Detection (AD) in data mining [17, 6]. It has many applications such as intrusion detection in computer networks, fraudulent transaction detection in banking and malignant tumor detection in healthcare.
In the data mining literature, the problem of anomaly detection is solved using three learning approaches :
Supervised: A classification model is learned using training instances from both normal and anomalous classes and the learned model is used to predict the class labels of test data. It requires sufficient labelled training samples from both classes. In many real-world applications, it might be very expensive or even impossible to obtain labelled training samples from the anomalous class.
Unsupervised: Instances in a database are ranked directly based on some outlier score. This approach does not require a training process and labelled samples. Assuming that anomalies are few and different, it uses distance/density to rank them. While this approach works quite well in scenarios where the assumption holds, it results in poor performance when this assumption does not hold, i.e., when there are far too many anomalies.
Semi-supervised: A profile of normal data is learned using labelled training data from the normal class only. In the testing phase, instances are ranked based on how well they comply with the learned profile of normal data. Neither it makes any assumption about anomalies nor it requires training samples from anomalous class.
Many traditional unsupervised and semi-supervised approaches are based on -Nearest Neighbours (NN) [6, 5, 3] which require pairwise distance calculations. Being few and different anomalies are expected to have large distances to their NNs compared to their neighbours. Because of the computational complexity to compute pairwise distances, these distance based methods are limited to small databases. They can not be used in databases with large number of instances.
In this era of big data, databases are growing rapidly in terms of both the number of instances and dimensionality. Fast automatic detection of anomalies in these massive databases is a challenging task. There are some efficient unsupervised anomaly detectors [11, 16, 9, 1] which are applicable to large databases.
Univariate histograms based method [6, 9, 1] is arguably the fastest anomaly detector. It creates univariate histograms in each dimension and computes anomaly score of an instance as the product of probability data mass in histograms where the instance falls in all dimensions. It assumes that attributes are independent of each other. Despite the strong assumption, it has been shown to produce competitive results compared to state-of-the-art contenders and runs very fast[9, 1]. It can detect anomalies that show outlying characteristics in any input features (i.e., one-dimensional subspaces). We refer to such anomalies as ‘Type I Anomalies’. However, it can not detect anomalies which look normal in each one-dimensional subspaces but show outlying characteristics in multidimensional subspaces (we refer to them as ‘Type II Anomalies’) as it does not capture the relationship between input features. It may be the case in many applications that there is a correlation between input features and anomalies are not different from normal instances in any single feature but they look different only if features are examined together.
In this paper, we attempt to overcome the above mentioned limitation of histogram based method using principal components [8, 10]. Principal Components (PCs) are mappings of subspaces of possibly correlated features in the input space into a new space. We introduce a new variant of histogram based anomaly detection method called ‘SPAD+’ where anomaly scores of instances are computed based on univariate histograms of the input features and principal components. This way, histograms in input features contribute to detect Type I anomalies and histograms in PCs contribute to detect Type II anomalies. Our empirical results in 15 benchmark datasets show that SPAD+ significantly improves the detection accuracy of SPAD, a histogram based anomaly detector using input features only, without compromising much in terms of runtimes. It produces better results in terms of both detection accuracy and runtime than traditional nearest neighbour based method and more recent faster anomaly detectors.
Rest of the paper is structured as follows. Preliminaries related to this paper and a review of widely used state-of-the-art fast anomaly detection methods are provided in Section 2. The proposed method of SPAD+ is discussed in Section 3 followed by experimental results in Section 4 and conclusions in the last section.
2 Preliminaries and related work
In this paper, we focus on continuous domain where each data instance is represented by an
-dimensional vector(where is a real domain). Each is the value of the feature of . We use semi-supervised approach for anomaly detection where a profile of normal data is learned from training data and test data (a mixture of normal and anomalous data) are ranked according to their anomaly scores based on the learned profile of normal data. Let be a collection of training data (all normal) and be a collection of test data (normal and anomalies). The task is to model the profile of normal data from and ranked data in based on their compliance to the profile of normal data.
In the rest of this section we review most widely used and fast anomaly detection methods.
2.1 Local Outlier Factor (LOF)
Local Outlier Factor (LOF)  is the most widely used and popular NN based anomaly detection method. It does not require any training. The anomaly score of is calculated based on the distances to its NNs in . It uses the concept of local reachability distance (lrd). Let be the set of NNs of in , is the euclidean distance between and , and is the euclidean distance between and its NN in . The anomaly score of is calculated as:
. Note that may not be exactly when there are many instances are equidistant to , i.e., .
The anomaly score of is computed with respect to its local neighbourhood in defined by . It measures how different is with respect to it NNs in terms of lrd. The anomaly score is based on the distances to its NNs and their distances to their NNs. It can be computationally very expensive when is large limiting its use in small datasets only.
2.2 Isolation forest (iforest)
Isolation forest (iforest)  is an efficient anomaly detector method which does not require pairwise distance calculations. It uses a collection of unsupervised random trees where each tree is constructed from a small subsamples of training data . The idea is to isolate every sample in . At each internal node of , the space is partitioned into two non-empty regions with a random split on a randomly selected attribute. The partitioning continued until the node has only one instance (which is isolated from the rest) or the maximum height of is reached. The anomaly score of
is estimated as the average path length overtrees:
where is the path length of in tree .
The simple intuition is that anomalies are more susceptible to isolation and they have shorter average path lengths than normal instances. It is shown to produce competitive detection accuracy to LOF but runs significantly faster as it does not require distance calculations .
2.3 Nearest neighbour distance in a small subsample (Sp)
Instead of using NNs of in the entire training data , Sugiyama and Borgwardt (2013)  argued that it is sufficient to use 1NN () in a small subsample of data . They suggested to use the distance of to it nearest neighbour in as the anomaly score.
It has been shown that as small as or produces competitive results to LOF but runs orders of magnitude faster in large datasets .
2.4 Histogram based method
Another simple and efficient anomaly detection method is based on univariate histograms . In each dimension , it creates histograms with a fixed number of equal-width bins and records the number of training instances falling in each . Anomalies are expected to fall in bins with small number of training samples.
Aryal et al. (2016) introduced a simple probabilistic anomaly detector (SPAD)  where multivariate probability of , , is approximated as the product of univariate probabilities assuming attributes are independent of each other. Approximation of is estimated using probability mass by discretising values in dimension
based on equal-width histograms. They use a modified version of equal-width discretisation which is more robust to skewed distribution of data in dimension. Instead of dividing the entire data range defined by ( and are the minimum and maximum value in dimension ) into equal-width bins, they divide the range ( and
are the mean and standard deviation of values in dimension) into
equal-width bins. The bin width in each dimension depends on the data variance in that dimension. The anomaly score ofis then estimated as:
is the bin in dimension where falls into. Note that the RHS of Eqn 4 is equivalent to the logarithm of .
Among the four anomaly detection methods reviewed above, LOF is a widely used baseline and other three methods are fast anomaly detectors for large datasets. Their time and space complexities are provided in Table 1. In terms of time and space complexities, SPAD is clearly the most efficient one.
Not training per se, it is to compute of all training instances
to use later while computing for test instances.
3 The proposed method: SPAD+
Though SPAD can detect anomalous data exhibiting outlying characteristics in any dimension (Type I Anomalies), it can not detect anomalous data which look normal in each dimension but exhibit outlying characteristics only when examined on multiple features together (Type II Anomalies) . For example, the data point shown on blue in Fig 1(a) looks perfectly normal when examined from each dimension individually (it is in the middle of the distribution in each dimension). SPAD can not detect such obvious anomaly. This is because it does not capture the relationship between input features. In many real-world applications, features can be related and anomalous data may not conform to it.
We believe the above mentioned issue of SPAD can be addressed to some extent with Principal Component Analysis (PCA)[8, 10]. PCA is a tool to transform potentially correlated features into new uncorrelated features . It learns the transformation matrix using the covariance matrix of observed data. Fig 1(b) is the transformation of data in Fig 1(a) in the PC space. The anomalous instance on blue is clearly an outlier in the second principal component (vertical axis in the Fig 1(b)) which can be easily detected by SPAD in the new PC space.
In order to cater for both Type I and Type II of anomalies, we propose to add principal components (PCs) as new features in addition to the input features. Input features contribute to detect Type I anomalies and PCs contribute to detect Type II anomalies. Using PCs only may not be a good idea because it may mask anomalies which are easily detectable in the input space111In our experiments, we observed that using PCs only produced worst results in many datasets..
For semi-supervised anomaly detection, PCA is applied to the training dataset to learn the transformation matrix which is used to transform data in both and . The feature size is increased from to by adding all PCs. In SPAD, histograms are constructed in each dimension of the input space and PC space. We refer the SPAD used in input features and PCs as SPAD+. Let be the transformation of in the PC space. The anomaly score of is estimated as:
In the literature, PCA is used mainly to reduce dimensionality of data where top PCs that capture the most variance in data are selected and PCs with low variance are ignored. However, the purpose of using PCs in SPAD+ is not to reduce dimensionality, they are used to capture correlation between attributes to detect Type II anomalies. Low variance PCs can be very useful to detect anomalies as they may contain few values which are significantly different from the rest, more likely to be anomalies (e.g., in Fig!1(b), the blue point is an outlier in the second PC where variance is lower than in the first PC). Ignoring low variance PCs can be counter productive for anomaly detection222In our experiments, we observed that adding top PCs capturing 95% variance of data produced worse results than adding all PCs in many datasets.. Thus, we add all PCs so that maximum possible anomalies are detected.
In terms of runtime and space complexities, SPAD+ has similar complexities as SPAD. It requires additional time (worst case) in the training process to compute the covariance matrix of the training data and its eigen decomposition. In the testing phase the only difference is that is increased to . Because of , it can be computationally expensive in high dimensional datasets where is large. However, it is the worst case and on average case it can be done faster. It may not be an issue unless is very large in thousands or millions.
4 Empirical evaluation
In this section, we provide details of our experimental setup to evaluate performances of SPAD+ against the four methods (LOF, iforest, Sp and SPAD) discussed in Section 2 and discuss results. We conducted experiments in semi-supervised setting — half of the data belonging to the normal class were used as training set and the remaining half along with data belonging to the anomaly class were used as test set as done in . Anomaly detection model was learned from the training data and test data were ranked based on their anomaly score using the learned model. We used Area under the receiver operation curve (AUC) as the performance evaluation measure. For the random methods - iforest and Sp, each experiment was repeated 10 times and reported the average AUC over 10 runs. For a dataset, the same and pair were used for all experiments. Min-max normalisation was used to ensure feature values in each dimension are in the same range.
All methods were implemented in Python using the Scikit-learn machine learning library. We used the LOF implementation available on the Scikit-learn library. All experiments were conducted on a Linux machine. Parameters in all algorithms were set to default values suggested by respective papers — LOF (); iforest ( and ); Sp (); and in SPAD and SPAD+.
We used 15 widely used publicly available benchmark datasets from various application areas such as space physics, health and medicine, cyber security, pharmaceutical chemistry and geographical information system (GIS). The characteristics of datasets in terms of dimensionality, training data size, the numbers of normal data and anomalies in the test data and application area are provided in Table 2.
The AUC of all five contenders in the 15 datasets are provided in Table 3. SPAD+ produced the best results in seven datasets, while producing the second best results in six datasets and the third best results in the remaining two datasets. It was ranked among the top three best performing methods in all the 15 datasets. The baseline method of LOF is the closest contender with the best AUC in six datasets. It was ranked second in one dataset, third in three datasets and fourth in four datasets. It was the worst performing method in the remaining four datasets. Sp and iforest did not produce the best result in any dataset and each of them produced the worst results in two datasets. While original SPAD produced the best result in two datasets, it produced the worst results in seven datasets.
From this results, it is clear that adding PCs, SPAD+ significantly improved the AUC in 13 out of 15 datasets, making it the top performing method from the worst. This is because of the ability to examine anomalies in individual input features and multidimensional subspaces represented by PCs which enable SPAD+ to detect both Type I and Type II anomalies. Existing fast anomaly detectors’ results are not comparable to those of SPAD+. iforest did not produce better AUC than SPAD+ in any dataset whereas Sp produced better AUC than SPAD+ in two datasets only. Compared to SPAD, SPAD+’s was slightly worse in Satellite and was significantly worse in Annthyroid but it was still better than all other contenders.
The total runtimes (training and testing) of the five contending measures in the 15 datasets are provided in Table 4. As expected SPAD+ ran slower than SPAD. Compared to other contenders, though it was slower in small datasets, it ran faster than them in large datasets. For example, SPAD+ was one order of magnitude faster than LOF in the largest dataset.
These results show that SPAD+ significantly improves anomaly detection performance without compromising much in terms of runtime, particularly in large datasets with a large number of data instances. It is a simple and intuitive method which is more appropriate for big data characterised by large data size.
The only difference between SPAD and SPAD+ is the addition of PCs as new features. One can argue that we can use the same idea with existing methods. We tested using the original input features and PCs and the results are presented in Table 5. The results of LOF and Sp remained largely unchanged whereas that of iforest were improved in some datasets. iforest managed to produce better AUC than SPAD+ in Ionosphere only, it’s results in other 14 datasets were worse than those of SPAD+.
Adding PCs did not add any value to LOF and Sp. It is because of the anomaly scores used which are based on nearest neighbours distances. They are already examining anomalies using all dimensions even when only input features are used. Adding PCs results in additional redundant feature and does not necessarily impact the outcome of the algorithm.
However, PCs can be useful in iforest. When only input features are used, it can examine anomalies using multiple dimensions to some extent because leaf nodes in trees are constructed by partitioning space using different attributes. However, the number of attributes used in the examination are restricted by the height of the leaf nodes. In best case, it can use up to 8 ( as was to 256 by default) dimensions. Because attribute is selected at random and tree building process terminates early when a sample instance is isolated, even 8 attributes may not be used. Therefore adding PCs enables it to examine anomalies using many attributes at the same time to detect more Type II anomalies. Because iforest mostly examines anomalies using randomly selected multiple attributes, it forces correlation even though it may not exist in the dataset which results in false positive — some normal instances may appear to be anomalies. This could possibly be one of the reasons why iforest can not perform as good as SPAD+ when input features and PCs are used because PCs by definition are uncorrelated, i.e., they are orthogonal to each other.
In SPAD, PCs are very useful to examine anomalies using potentially correlated attributes together to detect Type II anomalies. Because it examines each dimension individually and does not force any non existing correlation, it can avoid false positive cases like those in iforest. In SPAD+, original input features can be useful to detect Type I anomalies and PCs can be useful to detect Type II anomalies, complimenting each other to produce better outcome when applied together.
5 Concluding remarks
The idea of estimating multivariate probability as the product of univariate probabilities assuming that variables are independent of each other is widely used in other data mining and machine learning tasks such as naive Bayes classifier. It has not been explored enough in the anomaly detection task. Histogram based method such SPAD are simple, intuitive and very fast for anomaly detection. It should be a simple baseline to compare the performances of more complex and recent approaches. However, only a few studies such as [9, 1] are using it.
Even though SPAD produces results quite competitive to other efficient anomaly detectors, it has a limitation on detecting anomalies which rely on multiple attributes because it examines each attribute separately assuming that they are independent from each other . In this paper, we show that this limitation can be addressed to some extent by using principal components (PCs) of data as additional features. The idea is to double the feature size with input features and PCs.Then, use SPAD in the new space. We call the new variant of SPAD using PCs as SPAD+. Our empirical results show that SPAD+ significantly improves the performance of SPAD without compromising much in runtime and results in better performance than state-of-the-art methods. It runs faster than other existing fast anomaly detection methods.
It’s simplicity, effectiveness and efficiency make SPAD+ an ideal anomaly detector to use in big data with hundreds of thousands to millions of data instances. We believe this simple method sets a new baseline for anomaly detection performance comparison.
PCs based on the covariance matrix only capture linear relationships between input features. They can not capture non-linear relationships between features. Kernel-PCA  that uses kernel matrix can capture the non-linear relationships but it requires calculations of pairwise kernel similarities of data instances. It is computationally very expensive, limiting its use in small datasets only. We are looking forward to investigating how non-linear relationships present among different attributes in a dataset can be captured efficiently; thereby, improvising the current performance of SPAD+ algorithm.
-  (2016) Revisiting attribute independence assumption in probabilistic unsupervised anomaly detection. In Proceedings of the 11th Pacific Asia Workshop on Intelligence and Security Informatics, pp. 73–86. Cited by: §1, §1, §2.4, §2.4, §3, §5, §5.
-  (2018) Anomaly detection technique robust to units and scales of measurement. In Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 589–601. Cited by: §2.4.
-  (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the ninth ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 29–38. Cited by: §1.
-  (2008) Similarity measures for categorical data: a comparative evaluation. In Proceedings of the eighth SIAM International Conference on Data Mining, pp. 243–254. Cited by: §4.
-  (2000) LOF: Identifying Density-Based Local Outliers. In Proceedings of ACM SIGMOD Conference on Management of Data, pp. 93–104. Cited by: §1, §2.1.
-  (2009) Anomaly detection: a survey. ACM Computing Surveys 41 (3), pp. 15:1–15:58. Cited by: §1, §1, §1, §1.
-  (2017) UCI Machine Learning Repository. Note: http://archive.ics.uci.edu/mlUniversity of California, Irvine, School of Information and Computer Sciences Cited by: Table 2.
-  (1901) LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11), pp. 559–572. Cited by: §1, §3.
Histogram-based outlier score (hbos): a fast unsupervised anomaly detection algorithm.
Proceedings of the 35th German Conference on Artificial Intelligence, pp. 59–63. Cited by: §1, §1, §2.4, §2.4, §5.
-  (2005) Principal component analysis. Wiley Online Library. Cited by: §1, §3, §5.
-  (2008) Isolation forest. In Proceedings of the Eighth IEEE International Conference on Data Mining, pp. 413–422. Cited by: §1, §2.2, §2.2.
-  (2015) UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In 2015 Military Communications and Information Systems Conference (MilCIS), pp. 1–6. Cited by: Table 2.
-  (n.d.) Datasets. Note: Kagglehttps://www.kaggle.com/datasets Cited by: Table 2.
-  (n.d.) Datasets. Note: Universit of New Burnswick, Canadian Institute of Cybersecurityhttps://www.unb.ca/cic/datasets/index.html Cited by: Table 2.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.
Rapid distance-based outlier detection via sampling. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, pp. 467–475. Cited by: §1, §2.3, §2.3.
-  (2006) Introduction to data mining. Addison-Wesley Longman Publishing Corporation, Boston, USA. Cited by: §1, §4, §5.