Outlier detection is one of the main data mining and machine learning tasks, whose goal is to single out anomalous observations, also called outliers(Aggarwal, 2013). While the other data analysis approaches, such as classification, clustering or dependency detection, consider outliers as noise that must be eliminated, as pointed out in (Han and Kamber, 2001), “one person’s noise could be another person’s signal”, thus outliers themselves are of great interest in different settings, e.g. fraud detection, ecosystem disturbances, intrusion detection, cybersecurity, medical analysis, to cite a few.
. Data mining outlier approaches to outlier detection can be classified in supervised, semi-supervised, and unsupervised(Hodge and Austin, 2004; Chandola et al., 2009). Supervised methods take in input data labeled as normal and abnormal and build a classifier. The challenge there is posed by the fact that abnormal data form a rare class. Semi-supervised methods, also called one-class classifiers or domain description techniques, take in input only normal examples and use them to identify anomalies. Unsupervised methods detect outliers in an input dataset by assigning a score or anomaly degree to each object.
A commonly accepted definition fitting the unsupervised setting is the following: “Given a set of data points or objects, find those objects that are considerably dissimilar, exceptional or inconsistent with respect to the remaining data” (Han and Kamber, 2001). Unsupervised outlier detection methods can be categorized in several approaches, each of which assumes a specific concept of outlier. Among the most popular families there are statistical-based (Davies and Gather, 1993; Barnett and Lewis, 1994), deviation-based (Arning et al., 1996), distance-based (Knorr and Ng, 1998; Ramaswamy et al., 2000; Angiulli and Pizzuti, 2005; Angiulli and Fassetti, 2009), density-based (Breunig et al., 2000; Jin et al., 2001; Papadimitriou et al., 2003; Jin et al., 2006), reverse nearest neighbor-based (Hautamäki et al., 2004; Radovanović et al., 2015) angle-based (Kriegel et al., 2008), isolation-based (Liu et al., 2012), subspace-based (Aggarwal and Yu, 2001; Angiulli et al., 2009; Keller et al., 2012), ensemble-based (Lazarevic and Kumar, 2005; Aggarwal and Sathe, 2017) and others (Chandola et al., 2012; Zimek et al., 2012; Aggarwal, 2013; Akoglu et al., 2015).
This work focuses on unsupervised outlier detection in the full feature space. In particular, we present a novel notion of outlier, the Concentration Free Outlier Factor (CFOF), having the peculiarity to resist to the distance concentration phenomenon which is part of the so called curse of dimensionality problem (Bellman, 1961; Demartines, 1994; Beyer et al., 1999; Chávez et al., 2001; Francois et al., 2007; Angiulli, 2018). Specifically, the term distance concentration refers to the tendency of distances to become almost indiscernible as dimensionality increases. This phenomenon may greatly affect the quality and performances of data mining, machine learning, and information retrieval techniques, since all these techniques rely on the concept of distance, or dissimilarity, among data items in order to retrieve or analyze information. Whereas low-dimensional spaces show good agreement between geometric proximity and the notion of similarity, as dimensionality increases, counterintuitive phenomena like distance concentration and hubness may be harmful to traditional techniques. In fact, the concentration problem also affects outlier scores of different families due to the specific role played by distances in their formulation.
This characteristics of high dimensional data has generated in data analysis and data management applications the need for dimensionality resistant notions of similarity, that are similarities not affected by the poor separation between the furthest and the nearest neighbor in high dimensional space. Among the desiderata that a good distance resistant to dimensionality should possess, there are the to be contrasting and statistically sensitive, that is meaningfully refractory to concentration, and to be compact, or efficiently computable in terms of time and space (Aggarwal, 2001).
In the context of unsupervised outlier detection, (Zimek et al., 2012) identified different issues related to the treatment of high-dimensional data, among which the concentration of scores, in that derived outlier score become numerically similar, interpretability of scores, that fact that the scores often no longer convey a semantic meaning, and hubness, the fact that certain points occur more frequently in neighbor lists than others (Aucouturier and Pachet, 2008; Radovanovic et al., 2009; Angiulli, 2018).
Specifically, consider the number of observed points that have among their nearest neighbors, also called -occurrences or reverse -nearest neighbor count, or RNNc, for short, in the following. It is known that in low dimensional spaces, the distribution of-nearest neighbor graph, which follows the Erdős-Rényi graph model (Erdös and Rényi, 1959). However, it has been observed that as the dimensionality increases, the distribution of
becomes skewed to the right, resulting in the emergence ofhubs, which are points whose reverse nearest neighbors counts tend to be meaningfully larger than that associated with any other point.
Thus, the circumstance that the outlier scores tend to be similar poses some challenges in terms of their intelligibility, absence of a clear separation between outliers and inliers, and loss of efficiency of pruning rules aiming at reducing the computational effort.
The CFOF score is a reverse nearest neighbor-based score. Loosely speaking, it corresponds to measure how many nearest neighbors have to be taken into account in order for a point to be close to a sufficient fraction of the data population. We notice that this kind of notion of perceiving the abnormality of an observation is completely different from any other notion so far introduced. In the literature, there are other outlier detection approaches resorting to reverse nearest neighbor counts (Hautamäki et al., 2004; Jin et al., 2006; Lin et al., 2008; Radovanović et al., 2015). Methods such INFLO (Jin et al., 2006)
are density-based techniques considering both direct and reverse nearest neighbors when estimating the outlierness of a point. Earlyreverse nearest neighbor-based approaches, that are ODIN (Hautamäki et al., 2004), which uses as outlier score of , and the one proposed in (Lin et al., 2008), which returns as outliers those points having , are prone to the hubness phenomenon, that is the concentration of the scores towards the values associated with outliers, due to direct used of the function . Hence, to mitigate the hubness effect, (Radovanović et al., 2015)
proposed a simple heuristic method, namelyAntiHub, which refines the scores produced by the ODIN method by returning the weighted mean of the sum of the scores of the neighbors of the point and of the score of the point itself.
In this work we both empirically and theoretically show that the here introduced CFOF outlier score complies with all of the above mentioned desiderata. As a main contribution, we formalize the notion of concentration of outlier scores, and theoretically prove that CFOF does not concentrate in the Euclidean space for any arbitrarily large dimensionality
. To the best of our knowledge, there are no other proposals of outlier detection measures, and probably also of other data analysis measures related to the Euclidean distance, for which it has been provided the theoretical evidence that they are immune to the concentration effect.
We recognize that the kurtosis
of the data population, a well-known measure of tailedness of a probability distribution originating with Karl Pearson(Pearson, 1905; Fiori and Zenga, 2009; Westfall, 2014), is a key parameter for characterizing from the outlier detection perspective the unknown distribution underlying the data, a fact that has been neglected at least within the data mining literature. The kurtosis may range from
, for platykurtic distributions such as the Bernoulli distribution with success probability, to , for extreme leptokurtic or heavy-tailed distributions. Each outlier score must concentrate for due to the absolute absence of outliers. We prove that CFOF does not concentrate for any .
We determine the closed form of the distribution of the CFOF scores for arbitrarily large dimensionalities and show that the CFOF score of a point depends, other than on the parameter employed, on its squared norm standard score and on the kurtosis of the data distribution. The squared norm standard score of a data point is the standardized squared norm of the point under the assumption that the origin of the feature space coincides with the mean of the distribution generating points. We point out that the knowledge of the theoretical distribution of an outlier score is a rare, if not unique, peculiarity. We prove that the probability to observe larger scores increases with the kurtosis.
As for the hubness phenomenon, by exploiting the closed form of the CFOF scores distribution, we provide evidence that CFOF does not suffer of the hubness problem, since points associated with the largest scores always correspond to a small fraction of the data. Moreover, while previously known RNNc scores present large false positive rates for values of the parameter which are not comparable with , CFOF is able to establish a clear separation between outliers and inliers for any value of the parameter .
We theoretically prove that the CFOF score is both translation and scale-invariant. This allows to establish that CFOF has connections with local scores. Indeed, if we consider a dataset consisting of multiple translated and scaled copies of the same seed cluster, the set of the CFOF outliers consists of the same points from each cluster. More in the general, in the presence of clusters having different generating distributions, the number of outliers coming from each cluster is directly proportional to its size and to its kurtosis, a property that we called semi–locality.
As an equally important contribution, the design of the novel outlier score and the study of its theoretical properties allowed us to shed lights also on different properties of well-known outlier detection scores.
First, we determine that the semi–locality is a peculiarity of reverse nearest neighbor counts. This discovery clarifies the exact nature of the reverse nearest neighbor family of outlier scores: while in the literature this family of scores has been observed to be adaptive to different density levels, the exact behavior of this adaptivity was unclear till now.
Second, we identify the property each outlier score which is monotone increasing with respect to the squared norm standard score must possess in order to avoid concentration. We leverage this property to formally show that classic distance-based and density-based outlier scores are subject to concentration both for bounded and unbounded dataset sizes , and both for fixed and variable values of the parameter . Moreover, the convergence rate towards concentration of these scores is inversely proportional to the kurtosis of the data.
Third, as a theoretical confirmation of the proneness of to false positives, we show that the ratio between the amount of variability of the CFOF outlier scores and that of the RNNc outlier scores corresponds to of several orders of magnitude and, moreover, that the above ratio is even increasing with the kurtosis.
Local outlier detection methods, showing adaptivity to different density levels, are usually identified in the literature with those methods that compute the output scores by comparing the neighborhood of each point with the neighborhood of its neighbors. We point out that, as far as CFOF is concerned, its local behavior is obtained without the need to explicitly perform such a kind of comparison. Rather, since from a conceptual point of view computing CFOF scores can be assimilated to estimate a probability, we show that CFOF scores can be reliably computed by exploiting sampling techniques. The reliability of this approach descends from the fact that CFOF outliers are the points less prone to bad estimations.
Specifically, to deal with very large and high-dimensional datasets, we introduce the fast-CFOF technique which exploits sampling to avoid the computation of exact nearest neighbors and, hence, from the computational point of view does not suffer of the dimensionality curse affecting (reverse) nearest neighbor search techniques. The cost of fast-CFOF is linear both in the dataset size and dimensionality. The fast-CFOF
algorithm is efficiently parallelizable, and we provide a multi-core (MIMD) vectorized (SIMD) implementation.
The algorithm has an unique parameter , representing a fraction of the data population. The fast-CFOF algorithm supports multi-resolution analysis regarding the dataset at different scales, since different values can be managed simultaneously by the algorithm, with no additional computational effort.
Experimental results highlight that fast-CFOF is able to achieve very good accuracy with reduced sample sizes and, hence, to efficiently process huge datasets. Moreover, since its asymptotic cost does not depend on the actual value of the parameter , CFOF can efficiently manage even large values of this parameter, a property which is considered a challenge for different existing outlier methods. Moreover, experiments involving the CFOF score witness for the absence of concentration on real data, show that CFOF shows excellent accuracy performances on distribution data, and that CFOF is likely to admit configurations ranking the outliers better than other approaches on labelled data.
The study of theoretical properties is conducted by considering the Euclidean distance as dissimilarity measure, but from (Angiulli, 2018) it is expected they are also valid for any Minkowski’s metrics. Moreover, it is worth to notice that the applicability of the technique is not confined to the Euclidean space or to vector spaces. It can be applied both in metric and non-metric spaces equipped with a distance function. Moreover, while effectiveness and efficiency of the method do not deteriorate with the dimensionality, its application is perfectly reasonable even in low dimensions.
We believe the CFOF technique and the properties presented in this work provide insights within the scenario of outlier detection and, more in the general, of high-dimensional data analysis.
The rest of the work is organized as follows. Section 2 introduces the CFOF score and provides empirical evidence of its behavior. Section 3 studies theoretical properties of the CFOF outlier score. Section 4 presents the fast-CFOF algorithm for detecting outliers in large high-dimensional datasets. Section 5 describes experiments involving the propose approach. Finally, Section 6 draws conclusions and depicts future work.
2. The Concentration Free Outlier Factor
In this section, we introduce the Concentration Free Outlier Factor (CFOF), a novel outlier detection measure.
After presenting the definition of CFOF score (see Section 2.1), we provide empirical evidence of its behavior by discussing relationship with the distance concentration phenomenon (see Section 2.2) and with the hubness phenomenon (see Section 2.3). Theoretical properties of CFOF will be taken into account in subsequent Section 3.
Let denote a dataset of points, also said objects, belonging to an object space equipped with a distance function . In the following, we assume that is a vector space of the form , where , the dimensionality of the space, is a positive natural number and is usually the set of real numbers. However, we point out that the method can be applied in any object space equipped with a distance function (not necessarily a metric).
Given a dataset object and a positive integer , the -th nearest neighbor of is the dataset object such that there exists exactly dataset objects lying at distance smaller than from . It always holds that . We assume that ties are non-deterministically ordered.
The nearest neighbors set of , where is also said the neighborhood width, is the set of objects .
By we denote the number of objects having among their nearest neighbors:
also referred to as -occurrences function or reverse neighborhood size or reverse nearest neighbor count or RNNc, for short.
Definition 2.1 (CFOF outlier score).
Given a parameter , the Concentration Free Outlier Score, also referred to as CFOF (or –CFOF is the value of the parameter is not clear from the context), is defined as:
Thus, the CFOF score of represents the smallest neighborhood width, normalized with respect to , for which exhibits a reverse neighborhood of size at least .
The CFOF score belongs to the interval . In some cases, we will use absolute CFOF score values, ranging from to .
For complying with existing outlier detection measures that employ the neighborhood width as an input parameter, when we refer to the input parameter , we assume that, as far as CFOF is concerned, represents a shorthand for the parameter .
Figure 1 illustrates the computation of the CFOF score on a two-dimensional example dataset.
Intuitively, the CFOF score measures how many neighbors have to be taken into account in order for the object to be considered close by an appreciable fraction of the dataset objects. We point out that this kind of notion of perceiving the abnormality of an observation is completely different from any other notion so far introduced in the literature.
In particular, the point of view here is in some sense reversed with respect to distance-based outliers, since we are interested in determining the smallest neighborhood width for which the object is a neighbor of at least other objects, while distance-based outliers (and, specifically, the definition considering the distance from the th nearest neighbor) determine the smallest radius of a region centered in the object which contains at least other objects.
2.2. Relationship with the distance concentration phenomenon
One of the main peculiarities of the CFOF definition is its resistance to the distance concentration phenomenon, which is part of the so called curse of dimensionality problem (Demartines, 1994; Beyer et al., 1999; Francois et al., 2007; Angiulli, 2018). As already recalled, the term curse of dimensionality is used to refer to difficulties arising when high-dimensional data must be taken into account, and one of the main aspects of this curse is distance concentration, that is the tendency of distances to become almost indiscernible as dimensionality increases.
In this scenario (Demartines, 1994)
has shown that the expectation of the Euclidean distance of i.i.d. random vectors increases as the square root of the dimension, whereas its variance tends toward a constant. This implies that high-dimensional points appear to be distributed around the surface of a sphere and distances between pairs of points tend to be similar: according to(Angiulli, 2018), the expected squared Euclidean distance of the points from their mean is and the expected squared Euclidean inter-point distance is , where
The distance concentration phenomenon is usually characterized in the literature by means of a ratio between some measure related to the spread and some measure related to the magnitude of the distances. In particular, the conclusion is that there is concentration when the above ratio converges to as the dimensionality tends to infinity. The relative variance (Francois et al., 2007) is a measure of concentration for distributions, corresponding to the ratio between the standard deviation and the expected value of the distance between pairs of points. In the case of the Euclidean distance, or of any other Minkowski’s metric, the relative variance of data points generated by i.i.d. random vectors tends to zero as the dimensionality tends to infinity, independently from the sample size. As a consequence, the separation between the nearest neighbor and the farthest neighbor of a given point tend to become increasingly indistinct as the dimensionality increases.
Due to the specific role played by distances in their formulation, the concentration problem also affects outlier scores.
To illustrate, Figure 2 shows the ratio between the standard deviation and the mean of different families of outlier scores , that are the distance-based method aKNN (Angiulli and Pizzuti, 2005), the density-based method LOF (Breunig et al., 2000), the angle-based method ABOF (Kriegel et al., 2008), and CFOF, associated with a family of uniformly distributed (Figure (a)a
) and a normally distributed (Figure(b)b) datasets having fixed size and increasing dimensionality , for .
Results for each dimensionality value are obtained by () considering ten randomly generated different datasets, () computing outlier scores associated with each dataset, () sorting scores of each dataset, and () taking the average value for each rank position. The figure highlights that, except for CFOF, the other three scores, belonging to three different families of techniques, exhibit a concentration effect.
Figure 3 reports the sorted scores of the uniform datasets above discussed. For aKNN (Figure (a)a) the mean score value raises while the spread stay limited. For LOF (Figure (b)b) all the values tend to as the dimensionality increases. For ABOF (Figure (c)c) both the mean and the standard deviation decrease of various orders of magnitude with the latter term varying at a faster rate than the former one. As for CFOF (Figure (d)d) the score distributions for are very close and exhibit only small differences.
2.3. Relationship with the hubness phenomenon
It descends from its definition that CFOF has connections with the reverse neighborhood size, a tool which has been also used for characterizing outliers. The ODIN method (Hautamäki et al., 2004) uses of the reverse neighborhood size as an outlier score, which we refer also as RNN count, or RNNc for short. Outliers are those objects associated with the smallest RNN counts. However, it is well-known that the function suffers of a peculiar problem known as hubness (Aucouturier and Pachet, 2008; Radovanovic et al., 2009; Angiulli, 2018). As the dimensionality of the space increases, the distribution of becomes skewed to the right with increasing variance, leading to a very large number of objects showing very small RNN counts. Thus, the number of antihubs, that are objects appearing in a much smaller number of nearest neighbors sets (possibly they are neighbors only of themselves), overcomes the number of hubs, that are objects that appear in many more nearest neighbor sets than other points, and, according to the RNNc score, the vast majority of the dataset objects become outliers with identical scores.
Here we provide empirical evidence that CFOF does not suffer of the hubness problem, while we refer the reader to Section 3 for the formal demonstration of this property.
Figures (a)a and (b)b report the distribution of the value and of the CFOF absolute score for a ten thousand dimensional uniform dataset. Notice that CFOF outliers are associated with the largest score values, hence to the tails of the distribution, while RNNc outliers are associated with the smallest score values, hence with the largely populated region of the score distribution, a completely opposite behavior.
To illustrate the impact of the hubness problem with the dimensionality, Figures (c)c and (d)d show the cumulated frequency associated with the normalized, between and , increasing scores. The normalization has been implemented to ease comparison. As for CFOF, values have been obtained as . As for RNNc, values have been obtained as .
The curves clarify the deep difference between the two approaches. Here both and are held fixed, while is increasing (, the curve for is omitted for readability, since it is very close to ). As for RNNc, the hubness problem is already evident for , where objects with a normalized score correspond to about the of the dataset, while the curve for closely resembles that for , where the vast majority of the dataset objects have a normalized score close to . As for CFOF, the number of points associated with large score values always corresponds to a very small fraction of the dataset population.
3. Concentration free property of CFOF
In this section we theoretically ascertain properties of the CFOF outlier score.
The rest of the section is organized as follows. Section 3.1 introduces the notation exploited throughout the section and some basic definitions. Section 3.2 recalls the concept of kurtosis. Section 3.3
derives the theoretical cumulative distribution function of theCFOF score. Section 3.4 provides the definition of concentration of outlier scores together with the proof that the CFOF score does not concentrate and Section 3.5 discusses the effect of the data kurtosis on the distribution of CFOF outlier scores. Section 3.6 studies the behavior CFOF in presence of different distributions and establishes its semi–locality property. Finally, Section 3.7 studies the concentration properties of distance-based and density-based outlier scores, and Section 3.8 studies the concentration properties of reverse nearest neighbor-based outlier scores.
The concept of intrinsic dimensionality is related to the analysis of independent and identically distributed (i.i.d.) data. Although variables used to identify each datum could not be statistically independent, ultimately, the intrinsic dimensionality of the data is identified as the minimum number of variables needed to represent the data itself (van der Maaten et al., 2009)
. This corresponds in linear spaces to the number of linearly independent vectors needed to describe each point. Indeed, if random vector components are not independent, the concentration phenomenon is still present provided that the actual number
of “degrees of freedom” is sufficiently large(Demartines, 1994). Thus, results derived for i.i.d. data continue to be valid provided that the dimensionality is replaced with .
In the following we use -dimensional i.i.d. random vectors as a model of intrinsically high dimensional space. Specifically, boldface uppercase letters with the symbol “” as a superscript, such as , , , denote -dimensional random vectors taking values in . The components () of a random vector are random variables having pdfs (cdf ). A random vector is said independent and identically distributed (i.i.d.), if its components are independent random variables having common pdf (cdf ). The generic component of is also referred to as when its position does not matter.
Lowercase letters with “” as a superscript, such as , , , denote a specific -dimensional vector taking value in .
Given a random variable , and denote the mean and standard deviation of . The symbol , or simply when is clear from the context, denotes the
th central momentof (). When we used moments, we assume that they exists finite.
A sequence of independent non-identically distributed random variables having non-null variances and finite central moments () is said to have comparable central moments if there exist positive constants and . Intuitively, this guarantees that the ratio between the greatest and the smallset non-null moment remains limited.
An independent non-identically distributed random vector whose components have comparable central moments, can be treated as an i.i.d. random vector whose generic component is such that the -th degree () of its th central moment (), that is , is given by the average of the -th degree of the central moments of its components (Angiulli, 2018), defined as follows
This means that all the results given for i.i.d. -dimensional random vectors can be immediately extended to -dimensional independent non-identically distributed random vectors having comparable central moments and, more in the general, to the wider class of real-life data having degrees of freedom with comparable central moments.
(, resp.) denotes the cdf (pdf, resp.) of the normal standard distribution.
Results given for distributions in the following, can be applied to finite set of points by assuming large samples.
Definition 3.1 (Squared norm standard score).
Let be a realization of a random vector . Then, denotes the squared norm standard score of , that is
where denotes the mean vector of . Clearly, if all the components of the mean vector assume the same value , e.g. as in the case of i.i.d. vectors, then can be replaced by . The notation is used as a shorthand for whenever is clear from the context.
An outlier score function , or outlier score for simplicity, according to outlier definition , is a function that, given a set of objects, or dataset, (or , where denotes the power set of ), and an object , returns a real number in the interval , also said the outlier score (value) of . The notation is used to denote the outlier score of in a dataset whose elements are realizations of the random vector . The notation is used when or are clear from the context.
Definition 3.2 (Outliers).
Given parameter , the top- outliers, or simply outliers whenever is clear from the context, in a dataset of points according to outlier definition , are the points of associated with the largest values of score .111Some definitions associate outliers with the smallest values of score . In these cases we assume to replace with having outlier score .
The kurtosis is a measure of the tailedness of the probability distribution of a real-valued random variable, originating with Karl Pearson (Pearson, 1905; Fiori and Zenga, 2009; Westfall, 2014). Specifically, given random variable , the kurtosis , or simply whenever is clear from the context, is the fourth standardized moment of
Higher kurtosis is the result of infrequent extreme deviations or outliers, as opposed to frequent modestly sized deviations. Indeed, since kurtosis is the expected value of the standardized data raised to the fourth power. data within one standard deviation of the mean contribute practically nothing to kurtosis (note that raising a number that is less than to the fourth power makes it closer to zero), while the data values that almost totally contribute to kurtosis are those outside the above region, that is the outliers.
The lower bound is realized by the Bernoulli distribution with success probability, having kurtosis . Note that extreme platykurtic distributions, that is having kurtosis , have no outliers. However, there is no upper limit to the kurtosis of a general probability distribution, and it may be infinite.
The kurtosis of any univariate normal distribution is , regardless of the values of its parameters. It is common to compare the kurtosis of a distribution to this value. Distributions with kurtosis equal to are called mesokurtic, or mesokurtotic. Distributions with kurtosis less than are said to be platykurtic. These distributions produce fewer and less extreme outliers than does the normal distribution. An example of a platykurtic distribution is the uniform distribution, which has kurtosis . Distributions with kurtosis greater than are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero more slowly than a normal distribution, and therefore produces more outliers than the normal distribution.
3.3. Theoretical cdf of the Cfof score
Next, we derive the theoretical cdf and pdf of the CFOF outlier score together with the expected score associated with a generic realization of an i.i.d. random vector.
Theorem 3.3 ().
Let be a dataset consisting of realizations of an i.i.d. random vector and let be an element of . For arbitrarily large dimensionalities , the expected value and the cdf of the CFOF score are
where denotes the kurtosis of the random variable .
It is shown in (Angiulli, 2018) that, for arbitrary large values of , the expected number of -occurrences of in a dataset consisting of realizations of the random vector is given by (see Theorem 30 of (Angiulli, 2018))
Let denote the smallest integer such that . By exploiting the equation above it can be concluded that
Since , the expression of Equation (4) follows by expressing moments and in terms of the kurtosis .
For the CFOF score is constant and does not depend on and . However, for , since is a cdf, the CFOF score is monotone increasing with the standard score , thus if and only if .
As for the cdf of the CFOF score, we need to determine for which values of the condition holds. By levaraging Equation (4)
Consider the squared norm . Since the squared norm is the sum of i.i.d. random variables, as
, by the Central Limit Theorem, the distribution of the standard score oftends to a standard normal distribution. This implies that approaches a normal distribution with mean and standard deviation . Hence, for each ,
Thus, the probability converges to , and the expression of Equation (5) follows. ∎
In the following, the cdf of the CFOF score will be denoted also as , or simply when and the random vector generating the dataset are clear from the context. As for the pdf of the CFOF score,
Figure 5 reports the theoretical cdf according to Equation (5) and the theoretical pdf according to Equation (8) of the CFOF score associated with an i.i.d. random vector uniformly distributed and normally distributed.
The following result provides a semantical characterization of CFOF outliers in large dimensional spaces.
Theorem 3.4 ().
Let be a dataset consisting of realizations of an i.i.d. random vector . Then, for arbitrary large dimensionalities , the CFOF outliers of are the points associated with the largest squared norm standard scores.
The result descends from the fact that the CFOF score is monotone increasing with the squared norm standard score, as shown in Theorem 3.3. ∎
3.4. Concentration of outlier scores
Definition 3.5 (Concentration of outlier scores).
Let be an outlier definition with outlier score function . We say that the outlier score concentrates if, for any i.i.d. random vector having kurtosis , there exists a family of realizations of () such that, for any the following property holds222Note that if the above property holds “for any ” it is also the case that it holds “for any ”. We preferred to use the former condition for symmetry with respect to the scores smaller than . Indeed, for any the absolute value function can be removed from Equation (9).
that is to say the probability to observe a score value having relative distance not greater than from the reference score tends to as the dimensionality goes to infinity.
Theorem 3.6 ().
Let be an outlier definition with outlier score function which is monotone increasing with respect to the squared norm standard score. Then, the outlier score concentrates if and only if, for any i.i.d. random vector having kurtosis and for any , the family of realizations of having the property that , is such that
where and satisfy the following conditions
As already pointed out in the proof of Theorem 3.3, as , the distribution of the standard score of the squared norm tends to a standard normal distribution and, hence, for each , . Since the score is by assumption monotone increasing with the squared norm standard score, for arbitrarily large values of and for any realization of ,
If concentrates, then outlier score values must converge towards the outlier score of the points whose squared norm standard score corresponds to the expected value of . Given , in order to hold
it must be the case that
where and are defined as in the statement of the theorem. The result then follows from the last condition. ∎
Now we show that the separation between the CFOF scores associated with outliers and the rest of the CFOF scores is guaranteed in any arbitrary large dimensionality.
Theorem 3.7 ().
For any fixed , the CFOF outlier score does not concentrate.
Note that for any fixed , is finite. Moreover, since , from Equation (4) it holds that . Hence, if and only if if and only if if and only if if and only if .
Thus, for arbitrary large dimensionalities , for each realization of there exists such that and, hence, the CFOF outlier score does not concentrate. ∎
Note that, with parameter , the CFOF score does not concentrate for both bounded and unbounded sample sizes.
Figure 6 compares the empirical CFOF scores associated with the squared norm standard score of a normal dataset for increasing values of and the theoretical CFOF associated with arbitrarily large dimensionalities (). It can be seen that for large values the scores tend to comply with the value predicted by Equation (4). Moreover, as predicted by Theorem 3.7, the concentration of the CFOF scores is avoided in any dimensionality.
3.5. The effect of the data kurtosis
As recalled above, the kurtosis is such that . For extreme platykurtic distributions (i.e., having ) the CFOF score tends to , hence to a constant, while for extreme leptokurtic distributions (i.e., having ) the CFOF score tends to the cumulative distribution function of the standard normal distribution.
Theorem 3.8 ().
Let be a realization of an i.i.d. random vector having, w.l.o.g., null mean. Then, for arbitrary large dimensionalities ,
The two expression can be obtained by exploiting the closed form of the cdf of the CFOF score reported in Equation (4). ∎
Note that extreme platykurtic distributions have no outliers and, hence, are excluded by the definition of concentration of outlier score. For , CFOF scores are constant, and this is consistent with the absolute absence of outliers.
Figure 7 reports the CFOF scores associated with different kurtosis values . Curves are obtained by leveraging Equation (5) and setting . The curves on the left represent the cumulative distribution function of the CFOF scores. For infinite kurtosis the CFOF scores are uniformly distributed between and , since, from Equation (5), for , . Moreover, the curves highlight that the larger the kurtosis of the data and the larger the probability to observe higher scores. The curves on the right represent the score values ranked in decreasing order. The abscissa reports the fraction of data points in logarithmic scale. These curves allow to visualize the fraction of data points whose score will be above a certain threshold.
3.6. Behavior in presence of different distributions
Despite the apparent similarity of the CFOF score with distance-based scores, these two families of scores are deeply different and, while distance-based outliers can be categorized among to the global outlier scores, the CFOF shows adaptivity to different density levels, a characteristics that makes it more similar to local outlier scores.
This characteristics depends in part on the fact that actual distance values are not employed in the computation of the score. Indeed, the CFOF score is invariant to all of the transformations that do not change the nearest neighbor ranking, such as translating the data or scaling the data.
To illustrate, consider Figure 8 showing a dataset consisting of two normally distributed clusters, each consisting of points. The cluster centered in is obtained by translating and scaling by a factor the cluster centered in the origin. The top CFOF outliers for are highlighted. It can be seen that the outliers are the “same” objects of the two clusters. Notice that a similar behavior can be observed, that is outliers will emerge both in the sparser regions of the space and along the borders of clusters, also when the clusters are not equally populated, provided that they contain at least objects.
Now we will provide theoretical evidence that the above discussed properties of the CFOF score are valid in any arbitrary large dimensionality.
Definition 3.9 (Translation-invariant and homogeneous outlier score).
Let an i.i.d. random vector, let and , and let . An outlier score is said to be translation-invariant and homogeneous if, for all the realizations of , it holds that
Theorem 3.10 ().
For arbitrary large dimensionalities , the CFOF score is translation-invariant and homogeneous.
We note that the squared norm standard score of is identical to the squared norm standard score of . Indeed, the mean of is the vector , where (). As for . Analogously, , and and . Hence,
Moreover, we recall that the th central moment has the following two properties: (called translation-invariance), and (called homogeneity). Hence, . Since variables have identical central moments, by applying the above property to Equation (5), the statement eventually follows:
Next, we introduce the concept of i.i.d. mixture random vector as a tool for modeling an intrinsically high-dimensional dataset containing data populations having different characteristics.
An i.i.d. mixture random vector is a random vector defined in terms of i.i.d. random vectors with associated selection probabilities , respectively. Specifically, for each , with probability the random vector assumes value .
A dataset generated by an i.i.d. mixture random vector , consists of points partitioned into clusters composed of points, respectively, where each is formed by realizations of the random vector selected with probability ().
Given a set of clusters , we say that they are non-overlapping if, for each cluster and point , .
The following result clarifies how CFOF behaves in presence of clusters having different densities.
Theorem 3.11 ().
Let be a dataset of non-overlapping clusters generated by the i.i.d. mixture random vector , let , let be the top- CFOF outliers of , and let be the partition of induced by the clusters of . Then, for arbitrary large dimensionalities , each consists of the top- –CFOF outliers of the dataset generated by the i.i.d. random vector (), where
and is such that .
Since the clusters are non-overlapping, the first direct and reverse neighbors of each point are points belonging to the same cluster of . Hence, being , the CFOF score of point depends only on the points within its cluster. Thus the score with respect to the whole dataset can be transformed into the score with respect the cluster . Analogously, the parameter with respect to the whole dataset can be transformed into the parameter with respect to the cluster , by requiring that , that is . Thus, the cdf of the CFOF score can be formulated as
Consider now the score value such that . Then the expected number of outliers from cluster is , from which the result follows. ∎
Thus, from the above result it can be concluded that the number of outliers coming from each cluster is related both to its generating distribution and to its relative size .
As for the generating distribution, if we consider points having positive squared norm standard score (which form the most extreme half of the population), then at the same squared norm standard score value, the CFOF score is higher for points whose generating distribution has larger kurtosis.
Theorem 3.12 ().
Let and be two i.i.d. random vectors, and let and be two realizations of and , respectively, such that . Then, for arbitrary large dimensionalities , if and only if .
The result descends from the fact that for non-negative values, the CFOF score is monotone increasing with respect to the kurtosis parameter. ∎
Thus, the larger the cluster kurtosis, the larger the fraction of outliers coming from that cluster. To clarify the relationship between kurtosis and number of outliers we considered a dataset consisting of two clusters: the first (second, resp.) cluster consists of points generated according to an i.i.d. random vector having kurtosis (, resp.). We then varied the occurrence probability of the first cluster in the interval , while the occurrence probability of the second cluster is , and set to and the fraction of outliers to be selected to . Figure 9 reports the percentage of the top- outliers that come from the first cluster (that is ) as a function of for different values. It can be seen that, for the fraction is more closely related to (with for ), while in the general is directly proportional to the ratio .
Intuitively, for datasets consisting of multiple shifted and scaled copies of a given cluster, a score able to retrieve local outliers should return the same outliers from each cluster, both in