A new effective and efficient measure for outlying aspect mining

04/28/2020 ∙ by Durgesh Samariya, et al. ∙ Nanjing University Deakin University Federation University Australia 0

Outlying Aspect Mining (OAM) aims to find the subspaces (a.k.a. aspects) in which a given query is an outlier with respect to a given dataset. Existing OAM algorithms use traditional distance/density-based outlier scores to rank subspaces. Because these distance/density-based scores depend on the dimensionality of subspaces, they cannot be compared directly between subspaces of different dimensionality. Z-score normalisation has been used to make them comparable. It requires to compute outlier scores of all instances in each subspace. This adds significant computational overhead on top of already expensive density estimation—making OAM algorithms infeasible to run in large and/or high-dimensional datasets. We also discover that Z-score normalisation is inappropriate for OAM in some cases. In this paper, we introduce a new score called SiNNE, which is independent of the dimensionality of subspaces. This enables the scores in subspaces with different dimensionalities to be compared directly without any additional normalisation. Our experimental results revealed that SiNNE produces better or at least the same results as existing scores; and it significantly improves the runtime of an existing OAM algorithm based on beam search.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real-world datasets often have some anomalous data, a.k.a. outliers, which do not conform with the rest of the data. [3] formally defined outlier as: “An observation (or a subset of observations) which appears to be inconsistent with the remainder of that set of data”. Outlier Detection (OD) is an important task in data mining that deals with detecting outliers in datasets automatically. A wide range of OD algorithms has been proposed to detect outliers in a dataset. While those algorithms are good at detecting outliers, they cannot explain why a data instance is considered as an outlier, i.e., they cannot tell in which feature subset(s) the data instance is significantly different from the rest of the data.

Recently, researchers have started working on the problem of Outlying Aspect Mining (OAM), where the task is to discover feature subset(s) for a query where it significantly deviates from the rest of the data. Those feature subset(s) are called outlying aspects of the given query. It is worth to note that OAM and OD are different — the main aim in the former is to find aspects for an instance where it exhibits the most outlying characteristics while the latter focuses on detecting all instances exhibiting outlying characteristics in the given original input space.

Identifying outlying aspects for a query data object is useful in many real-world applications. For example, an insurance analyst may be interested in finding in which particular aspect(s) an insurance claim looks suspicious. Furthermore, when evaluating job applications, a selection panel wants to know in which aspect(s) an applicant is extraordinary compared to other applicants.

In the literature, the task of OAM is also referred to as outlying subspace detection[14] and outlying aspect mining[6, 12]. OAM algorithms require a score to rank subspaces based on the outlying degrees of the given query in all subspaces. Existing OAM algorithms[14, 6, 12] use traditional distance/density-based outlier scores as the ranking measure. Because distance/density-based outlier scores depend on the dimensionality of subspaces, they can not be compared directly to rank subspaces. [12] used -score normalisation to make them comparable. It requires to compute outlier scores of all instances in each subspace. It adds significant computational overhead on already expensive density estimation making OAM algorithms infeasible to run in large and/or high-dimensional datasets. Also, we discover an issue with -score normalisation that makes it inappropriate for OAM in some cases.

This paper makes the following contributions:

  1. Identify an issue of using

    -score normalisation of density-based outlier scores to rank subspaces and show that it has a bias towards a subspace having high variance.

  2. Propose a new simple measure called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE). It is independent of the dimensionality of subspaces and hence it can be used directly to rank subspaces. It does not require any additional normalisation.

  3. Validate the effectiveness and efficiency of SiNNE in OAM. Our empirical results show that SiNNE can detect more interesting outlying aspects than three existing scoring measures, particularly in real-world datasets. In addition to that, it allows the OAM algorithm to run orders of magnitude faster than existing state-of-the-art scoring measures.

The rest of the paper is organized as follows. Section 2 provides a review of the previous work related to this paper. The limitation of -score normalisation in OAM is discussed in Section 3. The proposed new outlier score SiNNE is presented in Section 4. Empirical evaluation results are provided in Section 5 followed by our comment on the Vinh et al. (2016)’s definition of dimensionality unbiasedness of a measure for OAM in Section 6. Finally, conclusions and future work are provided in Section 7. Some key symbols and notations used in this paper are provided in Table 1.

Symbol Definition
A set of data instances in an -dimensional space,

A data instance represented as a vector,

The set of input features, i.e.,
The set of all possible subspaces (non-empty subsets) of
A small subsample of data, , =
The euclidean distance between and in subspace
The set of -nearest neighbours of in subspace
Table 1: Key symbols and notations used.

2 Related work

To the best of our knowledge, [14] is the earliest work that defines the problem of OAM. They introduced a framework to detect an outlying subspace called High-dimensional Outlying Subspace Miner (HOS-Miner). They used a distance-based measure called ‘Outlying Degree’ ( in short) to rank subspaces. The of a query in subspace is computed as:

(1)

Distance is biased towards high dimensional subspaces because distance increases as the number of dimensions increases.

Instead of using the NN distances, Outlying Aspect Miner (OAMiner) [6]

uses density based on Kernel Density Estimator (KDE)

[11] to measure the outlierness of query data in each subspace:

(2)

where, is a kernel density estimation of in subspace , is the dimensionality of subspace (), is the kernel bandwidth in dimension .

In [6], authors have reported that density is a biased measure because density decreases dramatically as the number of dimensions increases. Densities of a query point in subspaces with different dimensionality cannot be compared directly. Therefore, to eliminate the effect of dimensionality, they proposed to use density rank as an outlying measure. They used the same OAMiner algorithm by replacing the kernel density value by its rank. OAMiner searches for all possible combinations of subspaces systematically by traversing in the depth-first manner [10].

Recently, [12] discussed the issue of using density rank as an outlier score in OAM and provided some examples where it can be counter-productive. Rather than using density rank, they proposed to use -score normalized density to make scores in subspaces with varying dimensionality comparable.

(3)

where and

are the mean and standard deviation of densities of all data instances in subspace

, respectively.

They proposed a beam search strategy to search for subspaces. It uses the breadth-first method [10] to search subspaces of up to a fixed number of dimensions called a beam width at each level of search space.

In recent work, [13] proposed sGrid density estimator, which is a smoothed variant of the traditional grid-based estimator (a.k.a histogram). Authors replaced the kernel density estimator by sGrid in the Bean search OAM proposed by [12]. They also used -score normalisation to make the density values of a query point in subspaces with varying dimensionality comparable. Because sGrid density can be computed faster than KDE, it allows Beam search OAM to run orders of magnitude faster.

Both density rank and -score normalisation require to compute outlier scores of all instances in the given dataset in each subspace to compute the score of the given query. This adds significant computational overhead making the existing OAM algorithms infeasible to run in large and/or high-dimensional datasets. [12] discussed the issue of using density rank and proposed to use -score normalized density. In the next section, we discuss an issue of using -score normalized density for OAM that makes it counter-productive in some data condition.

3 Issue of using -score normalised density

Because -score normalisation uses mean and variance of density values of all data instances in a subspace ( and ), it can be biased towards a subspace having high variation of density values (i.e., high ).

Let’s take a simple example to demonstrate this. Assume that and , , be two different subspaces of the same dimensionality (i.e., ). Intuitively, because they have the same dimensionality, they can be ranked based on the raw density (unnormalised) values of a query . Assuming , we can have even though if (i.e., is ranked higher than based on density -score normalisation just because of higher ).

To show this effect in a real-world dataset, let’s take an example of the Pendigits dataset (=9868 and =16). Fig. 1 shows the distribution of data in two three-dimensional subspaces and . Visually, the query represented by the red square appears to be more outlier in than in . This is consistent with its raw density values in the two subspaces, . However, the ranking is reversed after the -score normalisation, (). This is due to the higher .

(a)
(b)
Figure 1: Data distribution in two three-dimensional subspaces of the Pendigits dataset.

From the above example, we can say that Z-score normalisation has a bias towards a subspace having high variance. To overcome this weakness of -score normalisation, we proposed a new scoring measure in the next section which has no such bias in its raw form and does not require any normalisation.

4 The proposed new efficient score

There are two limitations of density-based scores in OAM: (i) they are dimensionality biased and it requires some normalisation for OAM; and (ii) they are expensive to compute in each subspace. Being motivated by the limitations of density-based scores in OAM, we introduce a new measure for OAM which is dimensionality unbias in its raw form and can be computed efficiently.

Being motivated by the isolation using Nearest Neighbor E

nsembles (iNNE) method for anomaly detection

[1, 2], we propose to use an ensemble of models where each model () is constructed from a small random subsample of data, , . Each model defines normal region as the area covered by a set hyperspheres centered at each , where the radius of the ball is the euclidean distance of to its nearest neighbour in . The rest of the space outside of the hyperspheres is treated as the anomaly region. An example of constructing from in a two-dimensional space from a dataset () and is shown in Fig. 2.

(a)
(b)
Figure 2: (a) An example dataset (samples on dark black are selected to be in to construct ); and (b) Normal region defined by the area covered by hyperspheres in .

In , a query is considered as a normal instance if it falls in the normal region (at least in one hypersphere), otherwise it is considered as an anomaly. The anomaly score of in , if it falls outside of all hyperspheres and 0 otherwise.

Using an ensemble of models, the final anomaly score of is defined as:

It is interesting to note that iNNE uses a different definition of using the radii of hyperspheres centered at the nearest neighbor of and their nearest neighbor in . Our definition is a lot simpler and more intuitive as anomalies are expected to fall in anomaly regions in many models than normal instances. It is a simpler version of iNNE. Hence we call the proposed measure SiNNE, where ‘S’ stands for “Simple”.

Because the area covered by each hypersphere decreases as the dimensionality of the space increases and so as the actual data space covered by normal instances. Therefore, SiNNE is independent of the dimensionality of space in its raw form without any normalisation making it ideal for OAM. It adapts to the local data density in the space because the sizes of of the hypersheres depend on the local density. It can be computed a lot faster than the -NN distance or density. Also, it does not require to compute outlier scores of all instances in each subspace (which is required with existing score for -score normalisation) which gives it a significant advantage in terms of runtime.

4.1 Time Complexity

SiNNE is a two-stage process, (i) Training stage, (ii) Evaluation stage. In the training stage, the nearest neighbor search in small subsamples is required to build a hyperspheres. It is done times. The time complexity of the training stage is (where is the dimensionality of the subspace). In the evaluation stage, the distance between and each needs to be computed to see if it falls in the normal region. It has to be done in all models. The time complexity of the evaluation stage is . The computation cost of SiNNE and existing scoring measures are presented in Table 2.

Scoring Measure Time Complexity
SiNNE
Density
Density Rank
Density -Score
sGrid -Score
Table 2: The time complexity to compute the score of one query in a subspace using different measures. Note that is the data size; is the dimensionality of subspace; and is the block size in bitset operation, a parameter used in sGrid.

5 Experiments

In this section, we present results of our empirical evaluation of the proposed measure of SiNNE against the state-of-the-art OAM measures of Kernel density rank (), -score normalised Kernel density () and -score normalised sGrid density () using both synthetic and real-world datasets in terms effectiveness and efficiency.

Implementation.

All measures and experimental setup were implemented in Java using WEKA platform [7]. We made the required changes in the Java implementation of iNNE provided by the authors to implement SiNNE. We implemented and based on the KDE implementation available in WEKA [7]. We used the Java implementations of sGrid made available by the authors [13].

We used the same Beam search strategy for the subspace search as done in [12] and [13].

Parameters.

We used default parameters as suggested in respective papers unless specified otherwise. For SiNNE, we set = 8 and = 100. and employ Kernel Density Estimator (KDE) to estimate density. KDE uses the Gaussian kernel with default bandwidth as suggested by [8]. The block size parameter () for bit set operation in sGrid was set as default to 64 as suggested by [13]. Parameters beam width () and maximum dimensionality of subspace () in Beam search procedure were set to 100 and 3, respectively, as done in [12] and [13].

Data sets.

We used both synthetic and real-world datasets to ascertain the efficiency and effectiveness of the contending scoring measures. The real-world datasets are from the [5]111All real-world datasets are downloaded from the ELKI outlier data repository. https://elki-project.github.io/datasets/outlier.. The synthetic datasets are from [9]222https://www.ipd.kit.edu/ muellere/HiCS/.. The characteristics of datasets in terms of data size and the dimensionality of the original input space are provided in Table 3. All datasets were normalized using min-max normalisation to ensure all attributes to be in the same range of [0,1] in all experiments.

Data Set #Data Size () #Dimension ()
Synthetic dataset 1000 10
Pendigits 6870 16
Shuttle 49097 9
ALOI 50000 33
KDDCup99 60632 38
Table 3: Data set statistics.

All experiments were conducted on a machine with AMD 16-core CPU and 64GB main memory, running on Ubuntu 18.03. All jobs were performed up to 10 days, and incomplete jobs were killed and marked as ‘’.

5.1 Evaluation I: Quality of discovered subspaces

In this subsection, we focus on the quality of the discovered subspaces. We discussed results in synthetic and real-world datasets separately.

5.1.1 Performance on synthetic datasets

Keller et. al. (2012) [9] provided several synthetic datasets, which are used in previous studies [6, 12, 13]. Data set has a fixed number of outliers for which outlying subspaces (ground truth) are known. The top-ranked subspaces by each measure were compared with the ground truth. We used the 10-dimensional synthetic data set provided by [9], which has 19 outliers. We passed all outliers one at a time as a query and performed beam search OAM using the different OAM scores. Table 4 shows the subspace found by SiNNE, , , and ground truths for all queries.

In terms of exact matches, SiNNE is the best performing measure which detected ground truth as the top outlying aspect for each query. and produced exact matches for 18 queries. is the worst performing measure, which produced the exact matches in five queries only.

-id Ground Truth SiNNE
172 {8, 9} {8, 9} {1, 8, 9} {8, 9} {8, 9}
183 {0, 1} {0, 1} {0, 1} {0, 1} {0, 1}
184 {6, 7} {6, 7} {4, 6, 7} {6, 7} {6, 7}
207 {0, 1} {0, 1} {0, 1, 7} {0, 1} {0, 1}
220 {2, 3, 4, 5} {2, 3, 4, 5} {2, 3, 4, 5, 7} {2, 3, 4, 5} {2, 3, 4, 5}
245 {2, 3, 4, 5} {2, 3, 4, 5} {2, 3, 4, 5} {2, 3, 4, 5} {3, 4, 5}
315 {0, 1} {0, 1} {0, 1, 9} {0, 1} {0, 1}
{6, 7} {6, 7} {0, 6, 7} {6, 7} {6, 7}
323 {8, 9} {8, 9} {2, 8, 9} {8, 9} {8, 9}
477 {0, 1} {0, 1} {0, 1, 2} {0, 1} {0, 1}
510 {0, 1} {0, 1} {0, 1, 5} {0, 1} {0, 1}
577 {2, 3, 4, 5} {2, 3, 4, 5} {0, 3, 7} {6, 7} {2, 3, 4, 5}
654 {2, 3, 4, 5} {2, 3, 4, 5} {1, 2, 3, 4, 5} {2, 3, 4, 5} {2, 3, 4, 5}
704 {8, 9} {8, 9} {0, 8, 9} {8, 9} {8, 9}
723 {2, 3, 4, 5} {2, 3, 4, 5} {0, 2, 3, 4, 5} {2, 3, 4, 5} {2, 3, 4, 5}
754 {6, 7} {6, 7} {6, 7} {6, 7} {6, 7}
765 {6, 7} {6, 7} {1, 6, 7} {6, 7} {6, 7}
781 {6, 7} {6, 7} {6, 7} {6, 7} {6, 7}
824 {8, 9} {8, 9} {6, 8, 9} {8, 9} {8, 9}
975 {8, 9} {8, 9} {8, 9} {8, 9} {8, 9}
  • It has two outlying subspaces.

Table 4: Comparison of SiNNE, , , and in the Synthetic data set. Discovered subspaces with the exact matches with the ground truths are bold-faced. -id represent query point index; the numbers in the bracket (subspace) are attribute indices.

5.1.2 Performance on real-world data sets

It is worth noting that we do not have ground truth of the real-world datasets to verify the quality of discovered subspaces. Also, there is no quality assessment measure/criteria of discovered subspaces. Thus, we compare the results of contending measures visually where the dimensionality of subspaces are up to 3. We used the state-of-the-art outlier detector called LOF [4] 333 We used implementation of LOF available in Weka [7] and parameter = 50. to find the top ( = 3) outliers and used them as queries.

Table 5-8 shows the subspaces discovered by each scoring measures in Pendigits, Shuttle, ALOI, and KDDCup99 datasets, respectively. Note that, we plotted all one-dimensional subspace using the histogram, where the number of bins was set to 10. Visually we can confirm that SiNNE identified better or at least similar outlying subspaces compared to existing measures of OAM.

-id SiNNE

293

1086

4539

Table 5: Visualization of discovered subspaces by SiNNE, , and in the Pendigits data set.
-id SiNNE

35368

38116

44445

Table 6: Visualization of discovered subspaces by SiNNE, , and in the Shuttle data set.
-id SiNNE

407

408

1156

Table 7: Visualization of discovered subspaces by SiNNE, , and in the ALOI data set.
-id SiNNE

43883

44812

46673

Table 8: Visualization of discovered subspaces by SiNNE, , and in the KDDCup99 data set.

5.2 Evaluation II: Efficiency

The average runtime of randomly chosen queries of the contending measures in the four real-world datasets are provided in Table 9. SiNNE and were able to finish in all four datasets. and were unable to complete within ten days in the two largest datasets - ALOI and KDDCup99. These results show that SiNNE enables the existing OAM approach (i.e., Beam) to run orders of magnitude faster in large datasets. SiNNE was at least four orders of magnitude faster than and where they could run in 10 days; and an order of magnitude faster than in the two largest datasets.

Dataset SiNNE
Pendigits 1 10536 12450 9
Shuttle 1 124781 125225 34
ALOI 25 365
KDDCup99 33 524
  • Expected to take more than 10 days.

Table 9: Average runtime (in seconds) of queries of SiNNE, , and on realworld datasets.

We also conducted a scale-up test of the contending measures w.r.t. (i) increasing data sizes () and (ii) increasing dimensionality (), using synthetic datasets. We generated synthetic datasets with different and where the data distribution is a mixture of five equal-sized Gaussian’s with random mean and unit variance in each dimension. The datasets were normalised to be in the same range [0,1]. For each data set, we randomly chose five points as queries and reported the average runtime.

(a) Data size ().
(b) Dimensionality ().
Figure 3: Scale-up test.

5.2.1 Scale-up test with the increase in data size

The first scale-up test with increasing data sizes was conducted using -dimension data set where data sizes were varied in the range of k, 500k, 1m, 5m, and 10m. Note that was used. The runtimes are presented in Fig. 3 LABEL:sub@fig:ScaleupTest_datasize. The dataset size and runtime are plotted in the logarithmic scale. Again all jobs were performed up to 10 days, and incomplete jobs were killed. SiNNE was the only measure to complete the task for the data set containing 10m instances. The and could complete in 10 days only in datasets having up to k instances, whereas could complete in the dataset with 5m instances, but it couldn’t complete in the dataset with 10m instances. The result confirms that SiNNE runs at-least two orders of magnitude faster than existing state-of-the-art measures.

The runtime of SiNNE in the dataset with 10m instances was 44 seconds whereas and were projected to take more than 30 days, and to take more than 15 days.

5.2.2 Scale-up test with the increase in dimensionality

In this scale-up test, we examined the efficiency of scoring metrics w.r.t the number of dimensions (). A wide range of values, {2, 5, 10, 50, 100, 200, 300, 500, 750, 1000}, were used with fixed data size k. Figure 3 LABEL:sub@fig:ScaleupTest_dimension shows the average runtimes of the contending measures. Note that the runtime is plotted using a logarithmic scale. Again all jobs were performed up to 10 days, and incomplete jobs were killed. SiNNE was the only measure to complete the task for datasets with dimension. could only complete up to dimensions, while and could complete only up to 5 dimensions.

The runtimes for the 1000-dimensional data set were as follows: SiNNE: 1 hr 8 min, : 100 days (projected runtime), : 100 days (projected runtime) and : 10 days (projected runtime).

6 A comment on the definition of dimensionality unbiasedness

Duan et al. (2015) discussed the need for dimensionality unbiased score in the OAM problem [6] and suggested to use ranks of densities instead of the raw densities. Vinh et al. (2016) provided formal definition of dimensionality unbiasedness as:

Definition 1 (Dimensionality unbiasedness [12])

A dimensionality unbiased outlyingness measure () is a measure of which the baseline value, i.e., average value for any data sample

drawn from a uniform distribution, is a quantity independent of the dimension of the subspace S, i.e.,

Both density rank and -score normalisation of density satisfies the above condition. Vinh et al. (2016) highlighted that density rank may not be appropriate in OAM and suggested to use -score normalisation of density values [12]. However, we discover that -score normalisation of density also may not be appropriate in some cases (discussed in Section 3).

Because the expectation in Definition 1 is over density of all data instances, derivatives of density such as density rank, -score normalisation, or even the simple mean normalisation will satisfy the condition. Though they make the scores of subspaces with different dimensionality comparable, our results show that they are a bias towards some data condition.

We argue that the given definition of dimensionality unbiasedness is not sufficient in outlying aspect mining. A better definition is required and it is still an open question.

7 Conclusions and Future work

In this study, we identify an issue of using -score normalisation of density to rank subspaces for OAM. Also, -score normalisation requires to compute densities of all instances in all subspaces making an OAM algorithm impossible to run in datasets with large data sizes or dimensionalities. We introduce an efficient and effective scoring measure for OAM called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE). SiNNE uses an isolation based mechanism to compute outlierness of the query in each subspace, which is dimensionally unbias. Therefore, SiNNE does not require any normalisation to compare the scores of subspaces with different dimensions: Its raw scores can be compared directly. As a result, it runs significantly faster than existing measures because it does not require to compute scores of all instances like rank or -Score normalisation. By replacing the existing scoring measure with the proposed scoring measure, the existing OAM algorithm can now easily run in datasets with millions of data instances and thousands of dimensions. Our results show that SiNNE identifies more convincing outlying subspaces for queries than existing measures.

Our future work aims to investigate the theoretical properties of SiNNE and a better definition of dimensionality unbiasedness in the context of OAM.

Acknowledgments

This work is supported by Federation University Research Priority Area (RPA) scholarship, awarded to Durgesh Samariya.

References

  • [1] T. R. Bandaragoda, K. M. Ting, D. Albrecht, F. T. Liu, and J. R. Wells (2014-12) Efficient anomaly detection by isolation using nearest neighbour ensemble. In 2014 IEEE International Conference on Data Mining Workshop, Vol. , pp. 698–705. External Links: Document, ISSN 2375-9232 Cited by: §4.
  • [2] T. R. Bandaragoda, K. M. Ting, D. Albrecht, F. T. Liu, Y. Zhu, and J. R. Wells (2017) Isolation-based anomaly detection using nearest-neighbor ensembles. Computational Intelligence, pp. 1–31. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1111/coin.12156 Cited by: §4.
  • [3] V. Barnett and T. Lewis (1984) Outliers in statistical data. 3rd Edition edition, John Wiley and Sons, New York. External Links: ISBN 0-471-93094-6 Cited by: §1.
  • [4] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander (2000) LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, New York, NY, USA, pp. 93–104. External Links: ISBN 1-58113-217-4, Link, Document Cited by: §5.1.2.
  • [5] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent, and M. E. Houle (2016-07-01) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30 (4), pp. 891–927. External Links: ISSN 1573-756X, Document, Link Cited by: §5.
  • [6] L. Duan, G. Tang, J. Pei, J. Bailey, A. Campbell, and C. Tang (2015-09-01) Mining outlying aspects on numeric data. Data Mining and Knowledge Discovery 29 (5), pp. 1116–1151. External Links: ISSN 1573-756X, Document, Link Cited by: §1, §2, §2, §5.1.1, §6.
  • [7] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten (2009-11) The weka data mining software: an update. SIGKDD Explor. Newsl. 11 (1), pp. 10–18. External Links: ISSN 1931-0145, Link, Document Cited by: §5, footnote 3.
  • [8] W. Härdle (2012) Smoothing techniques: with implementation in s. Springer Science & Business Media, New York. Cited by: §5.
  • [9] F. Keller, E. Muller, and K. Bohm (2012) HiCS: high contrast subspaces for density-based outlier ranking. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, ICDE ’12, Washington, DC, USA, pp. 1037–1048. External Links: ISBN 978-0-7695-4747-3, Link, Document Cited by: §5, §5.1.1.
  • [10] S. Russell and P. Norvig (2009) Artificial intelligence: a modern approach. 3rd edition, Prentice Hall Press, Upper Saddle River, NJ, USA. External Links: ISBN 0136042597, 9780136042594 Cited by: §2, §2.
  • [11] B. W. Silverman (1986) Density estimation for statistics and data analysis. Chapman & Hall, London. Cited by: §2.
  • [12] N. X. Vinh, J. Chan, S. Romano, J. Bailey, C. Leckie, K. Ramamohanarao, and J. Pei (2016-11-01) Discovering outlying aspects in large datasets. Data Mining and Knowledge Discovery 30 (6), pp. 1520–1555. External Links: ISSN 1573-756X, Document, Link Cited by: §1, §2, §2, §2, §5, §5, §5.1.1, §6, Definition 1.
  • [13] J. R. Wells and K. M. Ting (2019) A new simple and efficient density estimator that enables fast systematic search. Pattern Recognition Letters 122, pp. 92 – 98. External Links: ISSN 0167-8655, Document, Link Cited by: §2, §5, §5, §5, §5.1.1.
  • [14] J. Zhang, M. Lou, T. W. Ling, and H. Wang (2004)

    Hos-miner: a system for detecting outlyting subspaces of high-dimensional data

    .
    In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB ’04, , pp. 1265–1268. External Links: ISBN 0-12-088469-0, Link Cited by: §1, §2.