1 Introduction
Realworld datasets often have some anomalous data, a.k.a. outliers, which do not conform with the rest of the data. [3] formally defined outlier as: “An observation (or a subset of observations) which appears to be inconsistent with the remainder of that set of data”. Outlier Detection (OD) is an important task in data mining that deals with detecting outliers in datasets automatically. A wide range of OD algorithms has been proposed to detect outliers in a dataset. While those algorithms are good at detecting outliers, they cannot explain why a data instance is considered as an outlier, i.e., they cannot tell in which feature subset(s) the data instance is significantly different from the rest of the data.
Recently, researchers have started working on the problem of Outlying Aspect Mining (OAM), where the task is to discover feature subset(s) for a query where it significantly deviates from the rest of the data. Those feature subset(s) are called outlying aspects of the given query. It is worth to note that OAM and OD are different — the main aim in the former is to find aspects for an instance where it exhibits the most outlying characteristics while the latter focuses on detecting all instances exhibiting outlying characteristics in the given original input space.
Identifying outlying aspects for a query data object is useful in many realworld applications. For example, an insurance analyst may be interested in finding in which particular aspect(s) an insurance claim looks suspicious. Furthermore, when evaluating job applications, a selection panel wants to know in which aspect(s) an applicant is extraordinary compared to other applicants.
In the literature, the task of OAM is also referred to as outlying subspace detection[14] and outlying aspect mining[6, 12]. OAM algorithms require a score to rank subspaces based on the outlying degrees of the given query in all subspaces. Existing OAM algorithms[14, 6, 12] use traditional distance/densitybased outlier scores as the ranking measure. Because distance/densitybased outlier scores depend on the dimensionality of subspaces, they can not be compared directly to rank subspaces. [12] used score normalisation to make them comparable. It requires to compute outlier scores of all instances in each subspace. It adds significant computational overhead on already expensive density estimation making OAM algorithms infeasible to run in large and/or highdimensional datasets. Also, we discover an issue with score normalisation that makes it inappropriate for OAM in some cases.
This paper makes the following contributions:

Identify an issue of using
score normalisation of densitybased outlier scores to rank subspaces and show that it has a bias towards a subspace having high variance.

Propose a new simple measure called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE). It is independent of the dimensionality of subspaces and hence it can be used directly to rank subspaces. It does not require any additional normalisation.

Validate the effectiveness and efficiency of SiNNE in OAM. Our empirical results show that SiNNE can detect more interesting outlying aspects than three existing scoring measures, particularly in realworld datasets. In addition to that, it allows the OAM algorithm to run orders of magnitude faster than existing stateoftheart scoring measures.
The rest of the paper is organized as follows. Section 2 provides a review of the previous work related to this paper. The limitation of score normalisation in OAM is discussed in Section 3. The proposed new outlier score SiNNE is presented in Section 4. Empirical evaluation results are provided in Section 5 followed by our comment on the Vinh et al. (2016)’s definition of dimensionality unbiasedness of a measure for OAM in Section 6. Finally, conclusions and future work are provided in Section 7. Some key symbols and notations used in this paper are provided in Table 1.
Symbol  Definition 

A set of data instances in an dimensional space,  
A data instance represented as a vector, 

The set of input features, i.e.,  
The set of all possible subspaces (nonempty subsets) of  
A small subsample of data, , =  
The euclidean distance between and in subspace  
The set of nearest neighbours of in subspace 
2 Related work
To the best of our knowledge, [14] is the earliest work that defines the problem of OAM. They introduced a framework to detect an outlying subspace called Highdimensional Outlying Subspace Miner (HOSMiner). They used a distancebased measure called ‘Outlying Degree’ ( in short) to rank subspaces. The of a query in subspace is computed as:
(1) 
Distance is biased towards high dimensional subspaces because distance increases as the number of dimensions increases.
Instead of using the NN distances, Outlying Aspect Miner (OAMiner) [6]
uses density based on Kernel Density Estimator (KDE)
[11] to measure the outlierness of query data in each subspace:(2) 
where, is a kernel density estimation of in subspace , is the dimensionality of subspace (), is the kernel bandwidth in dimension .
In [6], authors have reported that density is a biased measure because density decreases dramatically as the number of dimensions increases. Densities of a query point in subspaces with different dimensionality cannot be compared directly. Therefore, to eliminate the effect of dimensionality, they proposed to use density rank as an outlying measure. They used the same OAMiner algorithm by replacing the kernel density value by its rank. OAMiner searches for all possible combinations of subspaces systematically by traversing in the depthfirst manner [10].
Recently, [12] discussed the issue of using density rank as an outlier score in OAM and provided some examples where it can be counterproductive. Rather than using density rank, they proposed to use score normalized density to make scores in subspaces with varying dimensionality comparable.
(3) 
where and
are the mean and standard deviation of densities of all data instances in subspace
, respectively.They proposed a beam search strategy to search for subspaces. It uses the breadthfirst method [10] to search subspaces of up to a fixed number of dimensions called a beam width at each level of search space.
In recent work, [13] proposed sGrid density estimator, which is a smoothed variant of the traditional gridbased estimator (a.k.a histogram). Authors replaced the kernel density estimator by sGrid in the Bean search OAM proposed by [12]. They also used score normalisation to make the density values of a query point in subspaces with varying dimensionality comparable. Because sGrid density can be computed faster than KDE, it allows Beam search OAM to run orders of magnitude faster.
Both density rank and score normalisation require to compute outlier scores of all instances in the given dataset in each subspace to compute the score of the given query. This adds significant computational overhead making the existing OAM algorithms infeasible to run in large and/or highdimensional datasets. [12] discussed the issue of using density rank and proposed to use score normalized density. In the next section, we discuss an issue of using score normalized density for OAM that makes it counterproductive in some data condition.
3 Issue of using score normalised density
Because score normalisation uses mean and variance of density values of all data instances in a subspace ( and ), it can be biased towards a subspace having high variation of density values (i.e., high ).
Let’s take a simple example to demonstrate this. Assume that and , , be two different subspaces of the same dimensionality (i.e., ). Intuitively, because they have the same dimensionality, they can be ranked based on the raw density (unnormalised) values of a query . Assuming , we can have even though if (i.e., is ranked higher than based on density score normalisation just because of higher ).
To show this effect in a realworld dataset, let’s take an example of the Pendigits dataset (=9868 and =16). Fig. 1 shows the distribution of data in two threedimensional subspaces and . Visually, the query represented by the red square appears to be more outlier in than in . This is consistent with its raw density values in the two subspaces, . However, the ranking is reversed after the score normalisation, (). This is due to the higher .
From the above example, we can say that Zscore normalisation has a bias towards a subspace having high variance. To overcome this weakness of score normalisation, we proposed a new scoring measure in the next section which has no such bias in its raw form and does not require any normalisation.
4 The proposed new efficient score
There are two limitations of densitybased scores in OAM: (i) they are dimensionality biased and it requires some normalisation for OAM; and (ii) they are expensive to compute in each subspace. Being motivated by the limitations of densitybased scores in OAM, we introduce a new measure for OAM which is dimensionality unbias in its raw form and can be computed efficiently.
Being motivated by the isolation using Nearest Neighbor E
nsembles (iNNE) method for anomaly detection
[1, 2], we propose to use an ensemble of models where each model () is constructed from a small random subsample of data, , . Each model defines normal region as the area covered by a set hyperspheres centered at each , where the radius of the ball is the euclidean distance of to its nearest neighbour in . The rest of the space outside of the hyperspheres is treated as the anomaly region. An example of constructing from in a twodimensional space from a dataset () and is shown in Fig. 2.In , a query is considered as a normal instance if it falls in the normal region (at least in one hypersphere), otherwise it is considered as an anomaly. The anomaly score of in , if it falls outside of all hyperspheres and 0 otherwise.
Using an ensemble of models, the final anomaly score of is defined as:
It is interesting to note that iNNE uses a different definition of using the radii of hyperspheres centered at the nearest neighbor of and their nearest neighbor in . Our definition is a lot simpler and more intuitive as anomalies are expected to fall in anomaly regions in many models than normal instances. It is a simpler version of iNNE. Hence we call the proposed measure SiNNE, where ‘S’ stands for “Simple”.
Because the area covered by each hypersphere decreases as the dimensionality of the space increases and so as the actual data space covered by normal instances. Therefore, SiNNE is independent of the dimensionality of space in its raw form without any normalisation making it ideal for OAM. It adapts to the local data density in the space because the sizes of of the hypersheres depend on the local density. It can be computed a lot faster than the NN distance or density. Also, it does not require to compute outlier scores of all instances in each subspace (which is required with existing score for score normalisation) which gives it a significant advantage in terms of runtime.
4.1 Time Complexity
SiNNE is a twostage process, (i) Training stage, (ii) Evaluation stage. In the training stage, the nearest neighbor search in small subsamples is required to build a hyperspheres. It is done times. The time complexity of the training stage is (where is the dimensionality of the subspace). In the evaluation stage, the distance between and each needs to be computed to see if it falls in the normal region. It has to be done in all models. The time complexity of the evaluation stage is . The computation cost of SiNNE and existing scoring measures are presented in Table 2.
Scoring Measure  Time Complexity 

SiNNE  
Density  
Density Rank  
Density Score  
sGrid Score 
5 Experiments
In this section, we present results of our empirical evaluation of the proposed measure of SiNNE against the stateoftheart OAM measures of Kernel density rank (), score normalised Kernel density () and score normalised sGrid density () using both synthetic and realworld datasets in terms effectiveness and efficiency.
Implementation.
All measures and experimental setup were implemented in Java using WEKA platform [7]. We made the required changes in the Java implementation of iNNE provided by the authors to implement SiNNE. We implemented and based on the KDE implementation available in WEKA [7]. We used the Java implementations of sGrid made available by the authors [13].
Parameters.
We used default parameters as suggested in respective papers unless specified otherwise. For SiNNE, we set = 8 and = 100. and employ Kernel Density Estimator (KDE) to estimate density. KDE uses the Gaussian kernel with default bandwidth as suggested by [8]. The block size parameter () for bit set operation in sGrid was set as default to 64 as suggested by [13]. Parameters beam width () and maximum dimensionality of subspace () in Beam search procedure were set to 100 and 3, respectively, as done in [12] and [13].
Data sets.
We used both synthetic and realworld datasets to ascertain the efficiency and effectiveness of the contending scoring measures. The realworld datasets are from the [5]^{1}^{1}1All realworld datasets are downloaded from the ELKI outlier data repository. https://elkiproject.github.io/datasets/outlier.. The synthetic datasets are from [9]^{2}^{2}2https://www.ipd.kit.edu/ muellere/HiCS/.. The characteristics of datasets in terms of data size and the dimensionality of the original input space are provided in Table 3. All datasets were normalized using minmax normalisation to ensure all attributes to be in the same range of [0,1] in all experiments.
Data Set  #Data Size ()  #Dimension () 

Synthetic dataset  1000  10 
Pendigits  6870  16 
Shuttle  49097  9 
ALOI  50000  33 
KDDCup99  60632  38 
All experiments were conducted on a machine with AMD 16core CPU and 64GB main memory, running on Ubuntu 18.03. All jobs were performed up to 10 days, and incomplete jobs were killed and marked as ‘’.
5.1 Evaluation I: Quality of discovered subspaces
In this subsection, we focus on the quality of the discovered subspaces. We discussed results in synthetic and realworld datasets separately.
5.1.1 Performance on synthetic datasets
Keller et. al. (2012) [9] provided several synthetic datasets, which are used in previous studies [6, 12, 13]. Data set has a fixed number of outliers for which outlying subspaces (ground truth) are known. The topranked subspaces by each measure were compared with the ground truth. We used the 10dimensional synthetic data set provided by [9], which has 19 outliers. We passed all outliers one at a time as a query and performed beam search OAM using the different OAM scores. Table 4 shows the subspace found by SiNNE, , , and ground truths for all queries.
In terms of exact matches, SiNNE is the best performing measure which detected ground truth as the top outlying aspect for each query. and produced exact matches for 18 queries. is the worst performing measure, which produced the exact matches in five queries only.
id  Ground Truth  SiNNE  

172  {8, 9}  {8, 9}  {1, 8, 9}  {8, 9}  {8, 9} 
183  {0, 1}  {0, 1}  {0, 1}  {0, 1}  {0, 1} 
184  {6, 7}  {6, 7}  {4, 6, 7}  {6, 7}  {6, 7} 
207  {0, 1}  {0, 1}  {0, 1, 7}  {0, 1}  {0, 1} 
220  {2, 3, 4, 5}  {2, 3, 4, 5}  {2, 3, 4, 5, 7}  {2, 3, 4, 5}  {2, 3, 4, 5} 
245  {2, 3, 4, 5}  {2, 3, 4, 5}  {2, 3, 4, 5}  {2, 3, 4, 5}  {3, 4, 5} 
315  {0, 1}  {0, 1}  {0, 1, 9}  {0, 1}  {0, 1} 
{6, 7}  {6, 7}  {0, 6, 7}  {6, 7}  {6, 7}  
323  {8, 9}  {8, 9}  {2, 8, 9}  {8, 9}  {8, 9} 
477  {0, 1}  {0, 1}  {0, 1, 2}  {0, 1}  {0, 1} 
510  {0, 1}  {0, 1}  {0, 1, 5}  {0, 1}  {0, 1} 
577  {2, 3, 4, 5}  {2, 3, 4, 5}  {0, 3, 7}  {6, 7}  {2, 3, 4, 5} 
654  {2, 3, 4, 5}  {2, 3, 4, 5}  {1, 2, 3, 4, 5}  {2, 3, 4, 5}  {2, 3, 4, 5} 
704  {8, 9}  {8, 9}  {0, 8, 9}  {8, 9}  {8, 9} 
723  {2, 3, 4, 5}  {2, 3, 4, 5}  {0, 2, 3, 4, 5}  {2, 3, 4, 5}  {2, 3, 4, 5} 
754  {6, 7}  {6, 7}  {6, 7}  {6, 7}  {6, 7} 
765  {6, 7}  {6, 7}  {1, 6, 7}  {6, 7}  {6, 7} 
781  {6, 7}  {6, 7}  {6, 7}  {6, 7}  {6, 7} 
824  {8, 9}  {8, 9}  {6, 8, 9}  {8, 9}  {8, 9} 
975  {8, 9}  {8, 9}  {8, 9}  {8, 9}  {8, 9} 

It has two outlying subspaces.
5.1.2 Performance on realworld data sets
It is worth noting that we do not have ground truth of the realworld datasets to verify the quality of discovered subspaces. Also, there is no quality assessment measure/criteria of discovered subspaces. Thus, we compare the results of contending measures visually where the dimensionality of subspaces are up to 3. We used the stateoftheart outlier detector called LOF [4] ^{3}^{3}3 We used implementation of LOF available in Weka [7] and parameter = 50. to find the top ( = 3) outliers and used them as queries.
Table 58 shows the subspaces discovered by each scoring measures in Pendigits, Shuttle, ALOI, and KDDCup99 datasets, respectively. Note that, we plotted all onedimensional subspace using the histogram, where the number of bins was set to 10. Visually we can confirm that SiNNE identified better or at least similar outlying subspaces compared to existing measures of OAM.
id  SiNNE  

293 

1086 

4539 
id  SiNNE  

35368 

38116 

44445 
id  SiNNE  

407 

408 

1156 
id  SiNNE  

43883 

44812 

46673 
5.2 Evaluation II: Efficiency
The average runtime of randomly chosen queries of the contending measures in the four realworld datasets are provided in Table 9. SiNNE and were able to finish in all four datasets. and were unable to complete within ten days in the two largest datasets  ALOI and KDDCup99. These results show that SiNNE enables the existing OAM approach (i.e., Beam) to run orders of magnitude faster in large datasets. SiNNE was at least four orders of magnitude faster than and where they could run in 10 days; and an order of magnitude faster than in the two largest datasets.
Dataset  SiNNE  

Pendigits  1  10536  12450  9 
Shuttle  1  124781  125225  34 
ALOI  25  365  
KDDCup99  33  524 

Expected to take more than 10 days.
We also conducted a scaleup test of the contending measures w.r.t. (i) increasing data sizes () and (ii) increasing dimensionality (), using synthetic datasets. We generated synthetic datasets with different and where the data distribution is a mixture of five equalsized Gaussian’s with random mean and unit variance in each dimension. The datasets were normalised to be in the same range [0,1]. For each data set, we randomly chose five points as queries and reported the average runtime.
5.2.1 Scaleup test with the increase in data size
The first scaleup test with increasing data sizes was conducted using dimension data set where data sizes were varied in the range of k, 500k, 1m, 5m, and 10m. Note that was used. The runtimes are presented in Fig. 3 LABEL:sub@fig:ScaleupTest_datasize. The dataset size and runtime are plotted in the logarithmic scale. Again all jobs were performed up to 10 days, and incomplete jobs were killed. SiNNE was the only measure to complete the task for the data set containing 10m instances. The and could complete in 10 days only in datasets having up to k instances, whereas could complete in the dataset with 5m instances, but it couldn’t complete in the dataset with 10m instances. The result confirms that SiNNE runs atleast two orders of magnitude faster than existing stateoftheart measures.
The runtime of SiNNE in the dataset with 10m instances was 44 seconds whereas and were projected to take more than 30 days, and to take more than 15 days.
5.2.2 Scaleup test with the increase in dimensionality
In this scaleup test, we examined the efficiency of scoring metrics w.r.t the number of dimensions (). A wide range of values, {2, 5, 10, 50, 100, 200, 300, 500, 750, 1000}, were used with fixed data size k. Figure 3 LABEL:sub@fig:ScaleupTest_dimension shows the average runtimes of the contending measures. Note that the runtime is plotted using a logarithmic scale. Again all jobs were performed up to 10 days, and incomplete jobs were killed. SiNNE was the only measure to complete the task for datasets with dimension. could only complete up to dimensions, while and could complete only up to 5 dimensions.
The runtimes for the 1000dimensional data set were as follows: SiNNE: 1 hr 8 min, : 100 days (projected runtime), : 100 days (projected runtime) and : 10 days (projected runtime).
6 A comment on the definition of dimensionality unbiasedness
Duan et al. (2015) discussed the need for dimensionality unbiased score in the OAM problem [6] and suggested to use ranks of densities instead of the raw densities. Vinh et al. (2016) provided formal definition of dimensionality unbiasedness as:
Definition 1 (Dimensionality unbiasedness [12])
A dimensionality unbiased outlyingness measure () is a measure of which the baseline value, i.e., average value for any data sample
drawn from a uniform distribution, is a quantity independent of the dimension of the subspace S, i.e.,
Both density rank and score normalisation of density satisfies the above condition. Vinh et al. (2016) highlighted that density rank may not be appropriate in OAM and suggested to use score normalisation of density values [12]. However, we discover that score normalisation of density also may not be appropriate in some cases (discussed in Section 3).
Because the expectation in Definition 1 is over density of all data instances, derivatives of density such as density rank, score normalisation, or even the simple mean normalisation will satisfy the condition. Though they make the scores of subspaces with different dimensionality comparable, our results show that they are a bias towards some data condition.
We argue that the given definition of dimensionality unbiasedness is not sufficient in outlying aspect mining. A better definition is required and it is still an open question.
7 Conclusions and Future work
In this study, we identify an issue of using score normalisation of density to rank subspaces for OAM. Also, score normalisation requires to compute densities of all instances in all subspaces making an OAM algorithm impossible to run in datasets with large data sizes or dimensionalities. We introduce an efficient and effective scoring measure for OAM called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE). SiNNE uses an isolation based mechanism to compute outlierness of the query in each subspace, which is dimensionally unbias. Therefore, SiNNE does not require any normalisation to compare the scores of subspaces with different dimensions: Its raw scores can be compared directly. As a result, it runs significantly faster than existing measures because it does not require to compute scores of all instances like rank or Score normalisation. By replacing the existing scoring measure with the proposed scoring measure, the existing OAM algorithm can now easily run in datasets with millions of data instances and thousands of dimensions. Our results show that SiNNE identifies more convincing outlying subspaces for queries than existing measures.
Our future work aims to investigate the theoretical properties of SiNNE and a better definition of dimensionality unbiasedness in the context of OAM.
Acknowledgments
This work is supported by Federation University Research Priority Area (RPA) scholarship, awarded to Durgesh Samariya.
References
 [1] (201412) Efficient anomaly detection by isolation using nearest neighbour ensemble. In 2014 IEEE International Conference on Data Mining Workshop, Vol. , pp. 698–705. External Links: Document, ISSN 23759232 Cited by: §4.
 [2] (2017) Isolationbased anomaly detection using nearestneighbor ensembles. Computational Intelligence, pp. 1–31. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1111/coin.12156 Cited by: §4.
 [3] (1984) Outliers in statistical data. 3rd Edition edition, John Wiley and Sons, New York. External Links: ISBN 0471930946 Cited by: §1.
 [4] (2000) LOF: identifying densitybased local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, New York, NY, USA, pp. 93–104. External Links: ISBN 1581132174, Link, Document Cited by: §5.1.2.
 [5] (20160701) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30 (4), pp. 891–927. External Links: ISSN 1573756X, Document, Link Cited by: §5.
 [6] (20150901) Mining outlying aspects on numeric data. Data Mining and Knowledge Discovery 29 (5), pp. 1116–1151. External Links: ISSN 1573756X, Document, Link Cited by: §1, §2, §2, §5.1.1, §6.
 [7] (200911) The weka data mining software: an update. SIGKDD Explor. Newsl. 11 (1), pp. 10–18. External Links: ISSN 19310145, Link, Document Cited by: §5, footnote 3.
 [8] (2012) Smoothing techniques: with implementation in s. Springer Science & Business Media, New York. Cited by: §5.
 [9] (2012) HiCS: high contrast subspaces for densitybased outlier ranking. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, ICDE ’12, Washington, DC, USA, pp. 1037–1048. External Links: ISBN 9780769547473, Link, Document Cited by: §5, §5.1.1.
 [10] (2009) Artificial intelligence: a modern approach. 3rd edition, Prentice Hall Press, Upper Saddle River, NJ, USA. External Links: ISBN 0136042597, 9780136042594 Cited by: §2, §2.
 [11] (1986) Density estimation for statistics and data analysis. Chapman & Hall, London. Cited by: §2.
 [12] (20161101) Discovering outlying aspects in large datasets. Data Mining and Knowledge Discovery 30 (6), pp. 1520–1555. External Links: ISSN 1573756X, Document, Link Cited by: §1, §2, §2, §2, §5, §5, §5.1.1, §6, Definition 1.
 [13] (2019) A new simple and efficient density estimator that enables fast systematic search. Pattern Recognition Letters 122, pp. 92 – 98. External Links: ISSN 01678655, Document, Link Cited by: §2, §5, §5, §5, §5.1.1.

[14]
(2004)
Hosminer: a system for detecting outlyting subspaces of highdimensional data
. In Proceedings of the Thirtieth International Conference on Very Large Data Bases  Volume 30, VLDB ’04, , pp. 1265–1268. External Links: ISBN 0120884690, Link Cited by: §1, §2.
Comments
There are no comments yet.