A Comprehensive Survey on Outlying Aspect Mining Methods

05/06/2020 ∙ by Durgesh Samariya, et al. ∙ Nanjing University Deakin University Federation University Australia 0

In recent years, researchers have become increasingly interested in outlying aspect mining. Outlying aspect mining is the task of finding a set of feature(s), where a given data object is different from the rest of the data objects. Remarkably few studies have been designed to address the problem of outlying aspect mining; therefore, little is known about outlying aspect mining approaches and their strengths and weaknesses among researchers. In this work, we have grouped existing outlying aspect mining approaches in three different categories. For each category, we have provided existing work that falls in that category and then provided their strengths and weaknesses in those categories. We also offer time complexity comparison of the current techniques since it is a crucial issue in the real-world scenario. The motive behind this paper is to give a better understanding of the existing outlying aspect mining techniques and how these techniques have been developed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The concept of outliers has been studied extensively in the statistics community from the century [8]. In real-world application scenarios, usually there is outlier data, a.k.a. anomaly, which differs from the rest of the data. The word outlier stands for a statistical observation that is markedly different in value from the others of the sample.111https://www.merriam-webster.com/dictionary/outlier. Accessed: 06 April 2020

Barnett and Lewis(1984) [4] formally defined outlier as: “An observation (or a subset of observations) which appears to be inconsistent with the remainder of that set of data”. Outlier Detection (OD) is an essential task in data mining that deals with detecting outliers in data sets automatically. Over the years, an enormous amount of research has been carried out in an attempt to detect outliers in a data set. Although those algorithms are detecting outliers very well, they are not able to explain why those points are considered as an outlier, i.e., they cannot tell in which feature subset(s) the data object significantly deviates from the rest of the data.

The explanation of outlier has led to a renewed interest in Outlying Aspect Mining (OAM). An outlying aspect mining is formally defined as the task of recognizing that feature subset(s) where a given data object is inconsistent with the remainder of that set of data objects. That given data object is called as a query, and those feature subset(s) are called as outlying aspects of the given query.

The following are some of the definitions found in the literature:

  • Outlying aspects mining discovers feature subsets (or subspaces) that describe how a query stand out from a given data set.[17]

  • [16] define outlying aspect mining as “problem of investigating, for a particular query object, the sets of features (a.k.a attributes, dimensions) that make it most unusual compared to the rest of the data.

Previous studies have termed this problem as outlier explanation [10], outlier interpretation [6], outlying subspace detection [18], outlying aspect mining [7, 16]. A recent line of research has established this problem as outlying aspect mining [7, 16, 17, 13].

Past studies have hinted at a link between OAM and OD. However, it is worth noting that OAM and OD are different — the main aim of OAM is to find aspects for a given data object, where it exhibits the most outlying characteristics while the latter focuses on detecting all instances exhibiting outlying characteristics in the given original input space.

Outlying aspect mining has many practical applications, such as an insurance analyst may be interested to find out in which particular aspect an insurance claim looks suspicious. Furthermore, when evaluating job applications, a selection panel wants to investigate in which specific aspect applicant is most different than others. For example, with similar qualifications and experience, John has the highest number of projects completed successfully.

Outlying aspect mining is a new and interesting topic among researchers. To the best of our knowledge, there is no such survey article that has been conducted as of now, which motivates us to write this survey. In this survey paper, we are providing a structured and in-depth review of research on OAM techniques. The work on OAM is categorized into three categories — 1) Score-and-Search based approach, 2) Feature selection based approach, and 3) Hybrid approach.

This paper is organized into seven distinct sections. Section 2 provides an overview of OAM approaches. Outlying aspect mining techniques are categorized in score-and-search based approaches (Section 3), feature selection based approaches (Section 4) and hybrid approaches (Section 5). We have discussed open challenges in Section 6. Concluding remarks are provided in Section 7.

2 Overview of OAM approaches

Symbol Definition
     

A set of data instances in an -dimensional space,

A data instance represented as a vector,

The set of input features, i.e.,
The set of all possible subspaces (non-empty subsets) of
The euclidean distance between and in subspace
The set of -nearest neighbors of in subspace
Table 1: Key symbols and notations used in this paper.

To start with, we have fixed some notations for the rest of the paper and introduced few preliminary definitions. The primary symbols and notations used are provided in Table 1. Let be a collection of data objects in -dimensional space. Each data object is represented as -dimensional vector .

As mentioned above, the OAM approaches are categorized into three categories which are as follows:

  1. Score-and-Search: In the score-and-search based approach, OAM algorithm requires the computation of the outlying degree of a query in each possible subspace in order to identify the subspace where it exhibits the highest degree of outlying characteristics w.r.t. the rest of the data.

  2. Feature Selection: In this approach, the problem of OAM is treated as a traditional problem of feature selection for classification.

  3. Hybrid Approach: In the hybrid approach, the problem of OAM, is solved using a combination of score-and-search and feature selection based approach.

3 Score-and-Search based approach

To date, most of the studies that have been conducted to solve OAM problem belong to this category. The score-and-search approach required scoring function to measure the outlying degree of the given query. Then outlyingness of a query will be compared in all possible subspaces to detect the most outlying aspects.

As far as we know, [18] is the earliest work, which addresses the problem of outlying aspect mining. Therein, the authors introduced a framework that detects the outlying subspace of a given query termed as HOS-Miner which stands for High-dimensional Outlying Subspace Miner. They formulate the problem as: for a given data object, identify the subspaces in which this query object is considerably dissimilar or inconsistent w.r.t. the rest of the data objects. Moreover, this problem mathematically is stated as follows: for a given data object , find the set of subspaces such that for each subspace , , where is the distance function (Equation 1), and is distance threshold. They described HOS-Miner as “outlier spaces” method.

In their work, they employed a distance-based scoring measure called Outlying Degree ( in short) to measure the outlyingness of the given query, which is the sum of the distances between the query and its k-nearest neighbors. The of a query point in subspace is calculated as :

(1)
Figure 1: The overview of HOS-Miner [18].

The process of HOS-Miner is shown in Fig. 1. The proposed framework is divided into four steps. In the first step, the X-tree indexing module executes X-Tree [5] indexing on the data set to enable -nearest neighbor (NN) search faster in subspace . In the second step, the random sampling module randomly selects samples from the data set and then performs a dynamic subspace search to examine downward and upward subspace pruning possibilities of low to high dimensional subspaces. In the subsequent module, the subspace outlier detection module calculates the outlier score of the query and performs a dynamic subspace search to find subspaces where the query object deviates from the rest of the data. The last module is a filtering module, which filters out the most outlying subspace and returns to the user.

Duan et al. (2015) [7] introduce Outlying Aspect Miner (OAMiner

in short), which uses a Kernel Density Estimation (KDE)

[14] based scoring measure to compute the outlyingness of query in subspace :

(2)

where, is a kernel density estimation of in subspace , is the dimensionality of subspace (), is the kernel bandwidth in dimension .

The study carried out by Duan et al. (2015)[7], stated that density is a bias towards high-dimensional subspaces – density tends to decrease as dimension increases. Thus, to remove the effect of dimensionality biasedness, they proposed to use the density rank of the query as a measure of outlyingness. To find the most outlying subspace of query, the density of all data point needs to compute in each subspace, where the subspace with the best rank is selected as an outlying aspect of the given query.

OAMiner systematically enumerates all the possible subspaces. In OAMiner, the author has used the set enumeration tree approach [12], which is widely used by the data mining research community. OAMiner searches for subspaces by traversing a depth-first manner [11]. OAMiner used some anti-monotonicity properties to prune the subspaces. Given data set , a query object and subspace , if = 1, then every super-set of cannot be a minimal subspace and thus can be pruned.

OAMiner has two fundamental challenges:

  1. OAMiner uses a density-based scoring function. Computing the density of each data point in each subspace is computationally expensive. Thus, it becomes infeasible in large and high dimensional data sets. The time complexity of finding the rank of

    in subspace is .

  2. OAMiner employs depth-first-search and utilizes anti-monotonicity property to prune subspace; therefore, an expensive search is required to find outlying aspects of the given query.

The work of Vinh et al. (2016) [16] captures the concept of dimensionality unbiasedness and further investigates scoring functions, which is dimensionally unbiased. Dimensionality unbiasedness is an essential property for outlying measures because the query object is compared in different subspaces with a different number of dimensions. They proposed two novel outlying scoring metric (1) density -score and (2) isolation Path score (iPath in short). In their work, they showed that the proposed -score and iPath are dimensionally unbiased.

Therein, the density -score is defined as follows:

(3)

where and

are the mean and standard deviation of the density of all data instances in subspace

, respectively.

The iPath score is motivated by isolation Forest

(iForest) anomaly detection approach

[9]. The intuition behind iForest is that anomalies are few and susceptible to isolation. iForest constructs trees, where each tree is constructed from randomly selected sub-samples (). Later, it divides using the axis-parallel random splits. Since in the outlying aspect mining context, the main focus is on the path length of the query; thus, authors have ignored other parts of the tree. In outlying aspect mining, the intuition behind iPath score is that in the most outlying subspace, a given query is easy to isolate than the rest of the data.

The process of calculating the iPath of query w.r.t. sub-samples of the data is

(4)

where is path length of in tree and subspace .

The demo of iPath is presented in Fig. 2. In Fig. 2, the red square is a query point in dimensional space. Each horizontal or vertical numbered line represent splits. In Fig. 2(a), to isolate query, 3 splits are required, whereas 7 splits are required to isolate query in Fig. 2(b).

Figure 2: An illustrative example of iPath. The query is presented as red square. (a) A random isolation path of a query point where it is an outlier. (b) A random isolation path of a query point where it is an inlier [16].

Vinh et al. (2016)[16] was the first to coin the term dimensionality unbiasedness.

Definition 1 (Dimensionality unbiased [16])

A dimensionality unbiased outlyingness measure () is a measure of which the baseline value, i.e., average value for any data sample

drawn from a uniform distribution, is a quantity independent of the dimension of the subspace S, i.e.,

In [16, Theorem 3], it is proven that rank transformation and -score normalization have resulted in a constant average value in any data distribution. It is worth noting that the

-score scoring function is not only normalized but also the variance of the normalized measures that are constant to dimensions.

The overall beam search process is divided into three stages. In the first stage, all 1-D subspaces are inspected to identify trivial outlying features. In the subsequent stage, an exhaustive search is performed on all possible dimensional subspaces. In the third stage, the beam search is implemented at level . The beam algorithm only keeps top subspaces (that is called beam width) in the search process. The total number of subspace considered by beam algorithm is in the order of where is a maximum dimension of subspace, and is the beamwidth.

[17] introduced a simple grid-based density estimator called sGrid. sGrid is a smoothed variant of a grid-based density estimator [14]. Let be a collection of data objects in -dimensional space, be a projection of a data object in subspace . The sGrid density of point is computed as points that fall in a bin that covers point and its surrounding neighbors. Fig. 3 shows an illustrative example of sGrid, in which is estimated using bins while is estimated using bins.

Figure 3: An illustrative example of the sGrid [17].

In their work, they showed that the proposed density estimator has advantages over the existing kernel density estimator in outlying aspect mining by replacing kernel density estimator to sGrid. By replacing KDE to sGrid density estimator, OAMiner [7] and Beam [16] runs two orders of magnitude faster than their origin implementation. However, sGrid is not a dimensionally unbiased measure; hence it requires -Score normalization. Again, it makes sGrid computationally inefficient.

Very recently, [13] proposed a Simple Isolation score using Nearest Neighbor Ensemble (SiNNE in short) measure which is motivated from Isolation using Nearest Neighbor Ensembles (iNNE in short) method for outlier detection [3]. SiNNE constructs ensemble of models (). Each model is constructed from randomly chosen sub-samples (. Each model have hyperspheres, where radius of hypersphere is the euclidean distance between ( to its nearest neighbor in . A working example of SiNNE model is constructed on -Dimensional data set having 20 data objects and presented in Fig. 4. The outlying score of in model , if falls in any of the ball and 1 otherwise. The final outlying score of using models is :

(5)

In their work, they argue that -score normalization is biased towards a subspace having high-density variance and the definition of dimensionality unbiasedness is not sufficient. SiNNE is computationally faster than density and distance-based measures.

(a)
(b)
Figure 4: (a) An example data set (samples on dark black are selected to be in to construct ); and (b) Normal region is defined as the area covered by hyperspheres in [13].

Strengths and Weaknesses.

The existing OAM score-and-search techniques show good performance. However, distance and density-based measures are computationally expensive. As a result, they are only applicable to very small data sets. The iPath score is a computationally fastest measure because it does not require any distance computation. However, the iPath score is unable to detect local outliers. sGrid density estimator is a great replacement of KDE density estimator because it is computationally efficient than KDE. However, sGrid is biased towards high dimensional subspaces. Thus, it requires -score normalization, which adds significant computational overhead. SiNNE is the second-fastest measure after iPath. However, iPath is unable to detect local outliers, whereas SiNNE can. In addition to that, it is an unbiased measure; hence there is no need for any normalization. The time complexity of each scoring measure is summarized in Table 2.

Scoring Measure Time Complexity
Density
Density Rank
Density -Score
iPath
sGrid -Score
SiNNE
Table 2: The time complexity to compute the score of one query in a subspace by using different measures. Note that is the data size; is the dimensionality of subspace; is the block size in bitset operation, a parameter used in sGrid; is sub samples size and is number of sets, which are parameters used in iPath.

4 Feature Selection

Compared to the above mentioned approach, a little study is available for feature selection based methods. In the feature selection approach [10, 6], firstly, the outlying aspect mining problem is transformed into classification and then performs some classical feature selection approaches to find an explanatory subspace of a given outlier.

In this line of work, [10] is the earliest work which performs outlier explanation on the numeric data sets. They termed the outlying aspect mining problem as outlier explanation. They formulate the problem of outlier explanation as: for a given outlier, detected by any outlier detection algorithm, find the possible explanation for that outlier. In this work, authors assume that the outlier is given as a query (input) data, and their aim is to find an explanatory subspace.

Outlier explanation converts problem of OAM in two class (inlier and outlier class) classification problem. For each outlier , a outlier class is generated from a Gaussian , where is x scalar matrix and , and = , is the distance between and its nearest neighbor. The negative class is constructed by -nearest neighbors of outlier point in full feature space and points from rest of the data set.

[2] has studied the problem of outlier property detection and introduced the outlying property detection technique. Given a categorical data set, the goal is to find out the top set of attributes from which the query point has the highest outliers score. [1] proposed a version for the numeric data set. For a given data set in dimensional space, a query object [1] finds the pairs (,), satisfying and where is referred as explanation and referred as property (dimension). In 2014, [6] introduced LOGP which stands for Local Outliers with Graph Projection. LOGP is a novel technique that offers a solution to two problems, (1) outlier detection and (2) outlier interpretation.

Strengths and Weaknesses.

The advantages of these methods are that they do not require any subspace search, so these methods are faster than score-and-search methods. However, feature selection based methods depend on nearest-neighbor techniques. As pointed out in Vinh et al. (2016) [16], -nearest neighbors in full dimensional space is dramatically different from the -NN in the subspace.

5 Hybrid Approach

To the best of our knowledge, OARank (stands for Outlying Aspect Mining via Feature Ranking) [15] is the only work which solves OAM problem using a hybrid approach. The proposed hybrid framework uses the strength of both feature selection based approach and score-and-search based approach. The OARank framework is a two-stage process. In the first stage, the OARank framework rank features as per their outlyingness of the query in that feature. In the second stage, the score-and-search technique is performed on the set of top-ranked features, where . However, the second stage is optional. The top selected feature is either used for manual user inspection or user can perform score-and-search on top ranked features.

The condition to choose features is as follows:

(6)

where is the one dimensional Gaussian kernel. and is bandwidth and center of Gaussian kernel respectively. is a normalization constant.

Strengths and Weaknesses.

The hybrid systems are built upon the connection between score-and-search and feature selection based approaches. OARank uses a kernel density estimator to determine subspace where it is minimized, which is again computationally prohibited in large and high dimensional data sets.

6 Open Challenges

Outlying aspect mining has slowly got little attention from researchers. However, there are many challenges that needs attention in the future. First and foremost challenge is that traditional outlying aspect mining score-and-search based approaches use distance or density estimation based scoring measures. These methods are easy to implement. However, these methods have a high time complexity, which is . Thus, they are infeasible in high dimensional and huge data sets. The most computationally expensive part of OAM is the computation of score, which is a repeated task for every data object in each subspace.

Another issue that still needs attention is that there is no such globally accepted evaluation measure for outlying aspect mining systems. Vinh et al. (2016) [16] proposed to use an entropy-based evaluation measure called consensus index in their work. However, Wells and Ting (2019) [17]

pointed out that, a consensus index is more suitable to evaluate clustering outcomes than assessing the outlierness of a query in a subspace. Therefore, one of the open research challenges is the development of an evaluation metric that can be used to evaluate detected outlying aspects of the given query by OAM systems.

An important part of OAM is to search the subspaces, where a given data object is different from the rest of the data objects. By using systematic search methods, OAM has to compute outlierness of a given query in each subspace. This technique makes OAM methods computationally expensive. So an appropriate search technique is needed to reduce the effect of the curse of dimensionality.

7 Conclusion

Outlying aspect mining is a new field, and a little is known about it among the research community, which motivates us to write this survey. In this survey, an attempt is made to summarise various ways in which the problem of outlying aspect mining has been solved in the past and discussed existing work, which is divided based on approaches. We have discussed the strengths and weaknesses of each approach in their respective categories. However, we are specifically interested in problems related to efficiency and effectiveness for high dimensional and large data sets. We believe there is still room for improvement in the area of outlying aspect mining, which offers lots of research opportunities in the future.

Acknowledgments

This work is supported by Federation University Research Priority Area (RPA) scholarship, awarded to Durgesh Samariya.

References

  • [1] F. Angiulli, F. Fassetti, G. Manco, and L. Palopoli (2017-01-01) Outlying property detection with numerical attributes. Data Mining and Knowledge Discovery 31 (1), pp. 134–163. External Links: ISSN 1573-756X, Document Cited by: §4.
  • [2] F. Angiulli, F. Fassetti, and L. Palopoli (2009-04) Detecting outlying properties of exceptional objects. ACM Trans. Database Syst. 34 (1), pp. 7:1–7:62. External Links: ISSN 0362-5915, Link, Document Cited by: §4.
  • [3] T. R. Bandaragoda, K. M. Ting, D. Albrecht, F. T. Liu, Y. Zhu, and J. R. Wells (2017) Isolation-based anomaly detection using nearest-neighbor ensembles. Computational Intelligence, pp. 1–31. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1111/coin.12156 Cited by: §3.
  • [4] V. Barnett and T. Lewis (1984) Outliers in statistical data. 3rd Edition edition, John Wiley and Sons, New York. External Links: ISBN 0-471-93094-6 Cited by: §1.
  • [5] S. Berchtold, D. A. Keim, and H. Kriegel (1996) The x-tree: an index structure for high-dimensional data. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB ’96, San Francisco, CA, USA, pp. 28–39. External Links: ISBN 1-55860-382-4, Link Cited by: §3.
  • [6] X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert (2014-03) Discriminative features for identifying and interpreting outliers. In 2014 IEEE 30th International Conference on Data Engineering, Vol. , pp. 88–99. External Links: Document, ISSN 1063-6382 Cited by: §1, §4, §4.
  • [7] L. Duan, G. Tang, J. Pei, J. Bailey, A. Campbell, and C. Tang (2015-09-01) Mining outlying aspects on numeric data. Data Mining and Knowledge Discovery 29 (5), pp. 1116–1151. External Links: ISSN 1573-756X, Document Cited by: §1, §3, §3, §3.
  • [8] F.Y. Edgeworth (1887) XLI. on discordant observations. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 23 (143), pp. 364–375. External Links: Document, Link, https://doi.org/10.1080/14786448708628471 Cited by: §1.
  • [9] F. T. Liu, K. M. Ting, and Z. Zhou (2008-12) Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, Vol. , pp. 413–422. External Links: Document, ISSN 1550-4786 Cited by: §3.
  • [10] B. Micenková, R. T. Ng, X. Dang, and I. Assent (2013-12) Explaining outliers by subspace separability. In 2013 IEEE 13th International Conference on Data Mining, Vol. , pp. 518–527. External Links: Document, ISSN 1550-4786 Cited by: §1, §4, §4.
  • [11] S. Russell and P. Norvig (2009) Artificial intelligence: a modern approach. 3rd edition, Prentice Hall Press, Upper Saddle River, NJ, USA. External Links: ISBN 0136042597, 9780136042594 Cited by: §3.
  • [12] R. Rymon (1992) Search through systematic set enumeration. In Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning, KR’92, Cambridge, MA, pp. 539–550. External Links: ISBN 1-55860-262-3, Link Cited by: §3.
  • [13] D. Samariya, K. M. Ting, and S. Aryal (2020) A new effective and efficient measure for outlying aspect mining. arXiv preprint arXiv:2004.13550. External Links: 2004.13550 Cited by: §1, Figure 4, §3.
  • [14] B. W. Silverman (1986) Density estimation for statistics and data analysis. Chapman & Hall, London. Cited by: §3, §3.
  • [15] N. X. Vinh, J. Chan, J. Bailey, C. Leckie, K. Ramamohanarao, and J. Pei (2015) Scalable outlying-inlying aspects discovery via feature ranking. In Advances in Knowledge Discovery and Data Mining, T. Cao, E. Lim, Z. Zhou, T. Ho, D. Cheung, and H. Motoda (Eds.), Cham, pp. 422–434. External Links: ISBN 978-3-319-18032-8 Cited by: §5.
  • [16] N. X. Vinh, J. Chan, S. Romano, J. Bailey, C. Leckie, K. Ramamohanarao, and J. Pei (2016-11-01) Discovering outlying aspects in large datasets. Data Mining and Knowledge Discovery 30 (6), pp. 1520–1555. External Links: ISSN 1573-756X, Document Cited by: 2nd item, §1, Figure 2, §3, §3, §3, §3, §4, §6, Definition 1.
  • [17] J. R. Wells and K. M. Ting (2019) A new simple and efficient density estimator that enables fast systematic search. Pattern Recognition Letters 122, pp. 92 – 98. External Links: ISSN 0167-8655, Document, Link Cited by: 1st item, §1, Figure 3, §3, §6.
  • [18] J. Zhang, M. Lou, T. W. Ling, and H. Wang (2004) Hos-miner: a system for detecting outlyting subspaces of high-dimensional data. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB ’04, pp. 1265–1268. External Links: ISBN 0-12-088469-0, Link Cited by: §1, Figure 1, §3.