DeepAI
Log In Sign Up

How I learned to stop worrying and love the curse of dimensionality: an appraisal of cluster validation in high-dimensional spaces

The failure of the Euclidean norm to reliably distinguish between nearby and distant points in high dimensional space is well-known. This phenomenon of distance concentration manifests in a variety of data distributions, with iid or correlated features, including centrally-distributed and clustered data. Unsupervised learning based on Euclidean nearest-neighbors and more general proximity-oriented data mining tasks like clustering, might therefore be adversely affected by distance concentration for high-dimensional applications. While considerable work has been done developing clustering algorithms with reliable high-dimensional performance, the problem of cluster validation–of determining the natural number of clusters in a dataset–has not been carefully examined in high-dimensional problems. In this work we investigate how the sensitivities of common Euclidean norm-based cluster validity indices scale with dimension for a variety of synthetic data schemes, including well-separated and noisy clusters, and find that the overwhelming majority of indices have improved or stable sensitivity in high dimensions. The curse of dimensionality is therefore dispelled for this class of fairly generic data schemes.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/06/2015

A Probabilistic ℓ_1 Method for Clustering High Dimensional Data

In general, the clustering problem is NP-hard, and global optimality can...
06/06/2018

On high-dimensional modifications of some graph-based two-sample tests

Testing for the equality of two high-dimensional distributions is a chal...
11/30/2022

High-Dimensional Wide Gap k-Means Versus Clustering Axioms

Kleinberg's axioms for distance based clustering proved to be contradict...
10/26/2020

A measure concentration effect for matrices of high, higher, and even higher dimension

Let n>m and A be an (m× n)-matrix of full rank. Then obviously the estim...
01/06/2022

A new measure for assessment of clustering based on kernel density estimation

A new clustering accuracy measure is proposed to determine the unknown n...
03/04/2021

Clusterplot: High-dimensional Cluster Visualization

We present Clusterplot, a multi-class high-dimensional data visualizatio...
08/13/2017

Mahalanonbis Distance Informed by Clustering

A fundamental question in data analysis, machine learning and signal pro...

1 Introduction

The notion of proximity is essential for a variety of unsupervised learning techniques, including nearest-neighbors-based analyses and clustering problems. It is known, however, that Euclidean distances tend to “concentrate” in high dimensions Beyer; Hinneburg; Pestov with the result that data points tend towards equidistance. This phenomena threatens the ability of algorithms based on Euclidean distance to perform when feature spaces become large. Along with the exponential increase in time complexity of high dimensional problems, distance concentration has earned its place as a pillar of the “curse of dimensionality”.

Quantitatively, distance concentration is exhibited as the limit

(1)

where the distances to the closest () and farthest () points from a particular reference point are driven to equality as the dimensionality,

, of the feature space increases. While the extent to which this phenomenon is realized depends on the nature of the data distribution, it is known to occur for independent and identically distributed (iid) data distributions with finite moments, as well as data with correlated features

Beyer

; this includes both uniformly-distributed and clustered data

Hinneburg. While there exist data distributions that do not exhibit distance concentration, like linear latent variable models Durrant, the curse of dimensionality has been demonstrated to affect a varied collection of both real-world and synthetic datasets Flexer2015; Kumari; Kaban (see Hall; Ahn for the required conditions for a dataset to exhibit distance concentration and related geometrical phenomena.)

Figure 1: Behavior of (left) and hubness (right) as a function of dimension for four types of data distributions: uniform and Gaussian, and datasets with 20 and 50 Gaussian clusters.

A related distance-oriented phenomenon in high-dimensional spaces is the formation of hubs: points that are close to a large fraction of the dataset Radovanovic09; Radovanovic10. The presence of hubs can result in highly unbalanced clustering, with a small number of clusters containing a disproportionate amount of data Tomasev14; Schnitzer15, and can render nearest neighbors-based analysis meaningless; like distance concentration, hubness is apparently a fairly generic occurrence in high-dimensional datasets Flexer2015. To compute hubness, one counts the number of times each point appears in the

-neighborhood of all other points. The skewness of this distribution, parameterized by

, is defined as the hubness.

To illustrate these phenomena, we plot the ratio and hubness with as a function of dimension for a variety of distributions in Figure 1

. Gaussian-distributed data exhibits both phenomena more strongly than uniformly-distributed data, with clustered data least susceptible.

Many avenues have been explored to combat and ameliorate these effects, including the use of similarity measures based on the notion of shared nearest neighbors Yin; Houle10 to perform clustering in high dimensional spaces. A substantial amount of work has been put towards the investigation of alternative distance metrics, like the L1 norm and fractional Minkowski distances, with allegedly improved high-dimensional discrimination Aggarwal2001; Aggarwal22001; Doherty2004; Verleysen05; Hsu; Jayaram, although the recent work of Mirkes finds no such benefit. Methods to remove hubs via distance scaling have been explored Schnitzer12; Schnitzer14 as have the use of hub-reducing distance measures Suzuki; shared nearest-neighbors have also been shown to reduce hubness Flexer. Dimensionality reduction, a popular set of techniques for doing away with irrelevant and confounding features, has been used to simply avoid the problem all together Espadoto; Jin; Berchtold.

The present work explores the effect of distance concentration on clustering in high-dimensional spaces. There has been substantial research into the development of “dimensional-tolerant” algorithms Steinbach; Bouveyron; Zimek1; Assent; Pandove; Echihabi, including subspace and projective methods Aggarwal2000; Fern; Parsons; Kriegel09; Moise; Kelkar and modifications to classic clustering algorithms Hornik; Wu14; Chakraborty. These methods are intended to outperform conventional algorithms like -means (and other Euclidean distance-based methods) when the number of features becomes large. However, very little attention seems to have been paid to the problem of cluster validation—the determination of the number of natural groupings present in a dataset—in high-dimensional spaces. The optimal value of various relative validity criteria, like silhouette scores, are commonly used to select the most appropriate number of partitions in a dataset; however, these indices are overwhelmingly based on Euclidean distance. If we are suspicious of the performance of Euclidean-based clustering algorithms, we should be equally concerned with the use of similarly implicated validation measures. This paper is a study of the performance of a wide variety of such validity criteria as the dimensionality of the feature space grows.

The most recent, and apparently sole, investigation of this question is the 2017 work of Tomasev and Radovanovic Tomasev. That work examined how 18 cluster quality indices scale with increasing dimension. The author’s reason as follows: if a particular index is maximized at the optimal partition (with clusters), for example, then the growth of this index with dimension indicates that it “prefers” higher dimensional distributions, and conversely for indices that are minimized at the optimal partition.

While this is true on its face, it does not tell us whether a particular index is still useful in high dimensions. To be of use, the practitioner must be able to identify the optimum of the index over some range of partitionings of the data, and this depends not only on the magnitude of the index at , but also on its values in the neighborhood of the optimum (at and ). As an example, suppose some index, , is maximized at and that this value grows with dimension: unless the values at and scale identically (or less strongly), the curvature of the function around will lessen and the optimum will be less prominent. This is quite possible in practice, since the scaling behavior of and are in general different. We refer to this “sharpness” of the optimum as the sensitivity of the index.

In this work we assess sensitivity as a function of dimension for 24 relative validity criteria from the standard literature for a variety of different simulated data schemes, including well-separated clusters as well as distributions with noise. We find that all but one of the indices either have improving or stable sensitivity as dimensionality increases for well-separated, Gaussian clusters. This observation agrees broadly with the work of Tomasev, but we find that several of the indices that were stable in that analysis (C-index, Caliński-Harabasz, and RS) actually degrade (C-index) or improve (Caliński-Harabasz and RS). A good number of indices also perform well on noisy data in high dimensions, and we observe that fractional Minkowski metrics might offer some improvements in some cases. In summary, this analysis concludes that relative validity criteria are unaffected by distance concentration in high-dimensional spaces.

Before continuing, a note on terminology. In the following, we refer to a cluster as one of the partitions found by a clustering algorithm. The number of clusters is a free parameter in most algorithms (the in -means). We refer to a grouping as the set of points arising from a single Gaussian distribution. The number of groupings is property of the dataset, and so optimal number of clusters is therefore equal to the number of groupings.

2 Index sensitivity

Figure 2: (a) Ball-Hall index as a function of number of clusters, , for a synthetic dataset with well-separated Gaussian clusters in dimensions. The correct number of clusters occurs at the “elbow” of this plot. (b) Ball-Hall index evaluated at as a function of dimension, , for well-separated Gaussian clusters, showing linear scaling behavior.

The sensitivity of an index, , is a measure of the prominence of the optimum of the function . We introduce this notion of sensitivity within the context of a few examples.

A fundamental quantity in clustering analysis is the

within-cluster dispersion, which is the sum of the squared distances of all points, , from the cluster barycenter, . It is often written WSS, for within-cluster sum of squares,

(2)

measures how tightly a cluster’s points are distributed about its center, and so furnishes a simple measure of relative cluster validity: if a given partition uses too few clusters, then some of these clusters will span multiple distinct groupings and will thus have larger mean WSS values than those clusters spanning single groupings. Alternatively, if a partition uses too many clusters, then all cluster points will be relatively close to their respective barycenters, and the mean WSS values will be approximately equal across the clusters. The mean dispersion, averaged over all clusters, plotted as a function of the number of clusters, , thus resembles an “elbow” which can be used to determine the optimal number of clusters admitted by a dataset, Figure 2 (a).

The Ball-Hall index is precisely this quantity Ball,

(3)

We would like to understand how indices like Ball-Hall perform as the dimensionality, , of the dataset is increased. If we assume that our data is arranged into well-separated, Gaussian groups, we can actually make some headway analytically. Suppose we partition the data into clusters. Analyzing each cluster separately, we can always shift its barycenter to zero; then, each term in Eq. (2) is the sum of squared, mean-zero Gaussians and so is -distributed with degrees of freedom. The mean dispersion, , is then the mean of iid random variables, which is equal to . The Ball-Hall index, which is the mean of these means, therefore scales linearly with dimension, Figure 2 (b). Since smaller WSS values indicate higher-quality clusters, the growth of WSS with dimension might suggest that the WSS (and indices based on it, like Ball-Hall) degrade as the dimensionality of the dataset increases. These are the kinds of scaling arguments made to assess the high-dimensional performance of various relative indices in Tomasev. But, this is not the whole story.

In practice, relative validity indices yield the preferred number of clusters, , either as an optimum or as a knee- or elbow-point111These latter types are referred to as difference-like in Vendramin because they can be converted to an optimization problem by taking suitable differences of consecutive values. when the index is plotted as a function of . Good indices develop strong optima at so that the correct number of clusters can be confidently and reliably determined. We refer to this property as the sensitivity of the index to the correct number of clusters, defined simply as the height (or depth) of the optimum relative to its neighboring points:

(4)

for generic index and where opt refers to the maximum or minimum, as the case may be. It is this sensitivity that we would like to study as a function of increasing dimensionality.

Returning to our example of the Ball-Hall index, as an elbow-like index, is most different from , since for (these points lie along the horizontal “forearm”), and so . We just demonstrated that scales linearly with dimension, and so we must now determine how scales. For ease of exposition, we consider the case of , and study . An example of how -means with two clusters tends to partition the data is shown in Figure 3. Notice that two natural groupings, and , are spanned by a single cluster, : if we shift the cluster barycenter to zero, the sum Eq. (2) for this cluster contains squared Gaussians from two different distributions with means and , each nonzero. The sum of squares of points from follow a noncentral distribution with mean , and similarly for points from . If we suppose that and if we assume groups of equal size, .

Figure 3: Example partitioning of three natural groupings into two clusters using -means. The barycenter of cluster lies at the midpoint of the two groupings, and .

Meanwhile, since is a zero-mean Gaussian and so distances are -distributed with mean , as per our earlier discussion. The BH index is the average of clusters and , . With for scaling linearly with dimension, the sensitivity of the Ball-Hall index is . As dimensionality increases, the sensitivity of the Ball-Hall index likewise increases at a linear rate (cf. Figure 4), contrary to any supposition based on the scaling behavior of alone.

Figure 4: (a) Ball-Hall index as a function of for data of three different dimensionalities, , 20, and 70, illustrating the increase in elbow steepness with dimension. (b) The sensitivity of the Ball-Hall index, which is proportional to this elbow steepness, as a function of dimension, showing a linear scaling behavior.

While Euclidean distances concentrate even for well-defined clusters in high-dimensions, the sensitivity of the Ball-Hall index is unaffected: sensitivity, as the difference of average applied to two different partitions, each unbounded and scaling linearly with , also grows linearly in without bound. This result will hold true for any index defined in proportion to WSS in this way.

The quantity can be thought of as describing the intra-cluster structure of the dataset. In this sense, it probes a single scale: those distances of order the cluster diameter. Another fundamental quantity, the between-cluster dispersion, or BSS (for between-cluster sum of squares), measures the degree of separation of different clusters,

(5)

where is the number of points in cluster and is the barycenter of the entire dataset. Importantly, BSS probes a different scale than WSS, and so provides new and separate information about the data partition. Many of the more robust relative validity criteria incorporate two or more quantities like these that operate on different scales. Whereas indices based on (like BH) essentially inherit its scaling behavior, once a second quantity is introduced there is great variety in how the resulting index scales, depending on how the base quantities are combined. Consider the index of Hartigan Hartigan,

(6)

based on the ratio of inter- and intra-cluster dispersions. Here is the pooled within-cluster sum of squares. The optimal number of clusters, , corresponds to the knee-point of with and for .

For datasets that exhibit natural groupings, it is sometimes reasonable to suppose that the datapoints within these groups follow some prescribed distribution (like Gaussians, as assumed above). It is harder, however, to anticipate how these group centroids might be distributed, and so a scaling analysis of BSS based on underlying data distributions, as performed for WSS, is not possible in general. Generically, though, we know that Euclidean squared distances scale linearly with dimension, and so we can write , for determined by the actual distribution of cluster and data centroids. From above, we know that and so we expect that as . In practice, the Hartigan index becomes constant quickly, when . Because both and also scale linearly with , the sensitivity of the Hartigan index, as the dimensionality increases222 As we will see in the next section, by transforming the Hartigan index (and other knee- and elbow-type indices) into optimization problems, for some the sensitivity improves with dimension. We will discuss why in the next section.

In this section, we have analytically examined how the sensitivities of two relative validity indices scale with increasing dimension for the case of well-separated Gaussian clusters. This analysis was simply to motivate, via example, our notion of sensitivity and to contrast it with the scaling behavior of indices evaluated at . We also observed that the sensitivity scales according to which base quantities are included in the index (e.g. within-cluster vs between-cluster) and how they are functionally combined. There are many dozens of other relative validity criteria in the literature of wildly different origin, but all are ultimately defined in terms of quantities relevant on one of three scales: within-cluster (W), between-cluster (B), and full dataset (D). For example, Ball-Hall is based on within-cluster distances (W-type), and Hartigan is defined in terms of both within- and between-cluster quantities (WB-type). An example of an index that incorporates a full dataset measure is the R-squared index,

(7)

where the denominator is the sum of squared distances between each data point and the centroid of the full dataset. The RS measure is also a function of within-cluster distances and so is of type-WD.

In the next section, we examine the scaling behavior of sensitivities of 24 relative validity criteria under a few different data schemes, including well-separated univariate Gaussians as well more noisy distributions. In addition to studying how the various criteria scale according to data scheme, we are also interested in learning whether the type of index (e.g. W vs WD) influences scaling behavior in any discernible way.

Figure 5: Mean relative sensitivity of (a) WB-type indices, and (b) W-, WD-, and WBD-type indices. Sensitivities less than one indicate a loss of sensitivity relative to that achieved at .

3 Experiments

The sensitivities of 24 relative validity criteria from the recent literature Liu2010; Vendramin; Tomasev; Desgraupes (listed in Table 1 and defined in the Appendix) are evaluated in this section for five different data schemes: well-separated, univariate Gaussian clusters, well-separated multivariate Gaussian clusters, well-separated univariate Gaussian clusters immersed in uniform noise, and well-separated univariate Gaussian clusters with some percentage of features (10% or 50%) uniformly distributed (to simulate un-informative, or irrelevant, features). These latter two data distributions are equivalent to Gaussian data with highly-colored noise. We vary the dimension of the space over the range , avoiding dimensions fewer than 5 because we find that uniformly distributed clusters tend to overlap333It is this overlapping of clusters in lower dimensions that we believe is responsible for the apparent improvement at higher-dimensions reported in for certain indices like the silhouette score. We present here results for , though like Tomasev we observe that sensitivity as a function of dimension is largely insensitive to the value of

Each index is tested against 100 realizations of each data scheme, and the sensitivities at each dimension are averaged over the 100 realizations. We report this average sensitivity relative to that at (the lowest dimension that we test), . For each dimension and data realization, -means clustering was used to partition the data across a small interval of -values centered on . We note that -means correctly partitioned when for all data realizations in all dimensions tested, consistent with the dimensionality study conducted in Sieranoja.

Index Optimum Type Reference
Baker-Hubert Max W Baker
Ball-Hall Elbow W Ball
Banfield-Raftery Elbow W Banfield
BIC Elbow W Pelleg
C Index Min WD Hubert
Caliński-Harabasz Max WB Calinski
Davies-Bouldin Min WB Davies
Dunn Max WB Dunn
G+ Min WD Rohlf
Isolation Index Knee WB Frederix
Krzanowski-Lai Max W Krzanowski
Hartigan Knee WB Hartigan
McClain-Rao Elbow WB McClain
PBM Max WBD Pakhira
Point-Biserial Max WB Milligan
RMSSTD Elbow W Halkidi
RS Knee WD Halkidi
Ray-Turi Elbow WB Ray
S_Dbw Elbow WB Halkidi2001
Silhouette Max WB Rousseeuw
Max WD Milligan
Trace W Elbow W Edwards
Wemmert-Gançarski Max WB Desgraupes
Xie-Beni Elbow WB Xie
Table 1: Relative validity criteria analyzed in this study.

All knee- and elbow-type indices are converted into optimization problems by applying the transformation Vendramin,

(8)

We mentioned in the last section that the sensitivity of the knee-type Hartigan index improves with dimension after applying the above difference transformation. This is because the points on the “thigh” (with ) flatten-out in higher dimensions and so the difference becomes small, driving up the value of .

3.1 Univariate Gaussian clusters

We first consider data with univariate Gaussian clusters. The centroid of each cluster is selected uniformly from the range

, and the variance of each cluster is fixed at unity. The number of points in each cluster are selected from a normal distribution with

and .

The sensitivities of the indices are plotted as a function of dimension in Figure 5. The 12 WB-type indices are plotted separately (a), and the 6 W-type, 5 WD-type (dashed), and 1 WBD-type (dotted) indices are plotted together in (b).

Figure 6: (a) Case : When a single cluster spans two natural groupings, two points within this cluster can have a greater separation than two points in different clusters. (b) Case : When a single natural grouping is split by two clusters, two points within this grouping but in different clusters can be closer together than two points within the same cluster spanning a single natural grouping.

All the WB-type indices have improved sensitivity as dimensionality increases for well-separated, Gaussian clusters. Of those indices in Figure 5 (b), only the C index noticeably degrades as dimensionality increases. All of the other WD-type indices (except for RS) have approximately stable sensitivities, that is, they have . Stable indices are scale free, with the desirable property that we know precisely how they will behave in higher dimensions. They are of considerable interest so warrant further investigation. This group of WD-type indices: Baker-Hubert , G+, and , are functions of the quantities and/or , where is the number of concordant, and the number of discordant pairs of points. Two data points and within the same cluster form a concordant pair if they are closer together than any two points and not within the same cluster; if and do not form a concordant pair, then there exist points and that form a discordant pair.

To understand the stability of these quantities’ sensitivities, consider, for example, the quantity : when fewer than clusters are used to partition the data, at least one cluster will contain at least two natural groupings so that and might be quite far apart. Referring to Figure 6 (a), a particular pair of points and within the same cluster spanning the two groups and will be farther apart than a great number of pairs and in different clusters, so is large when . Likewise, referring to Figure 6 (b), when , is large because in this case at least one natural grouping is split among at least two clusters, so that points and within such a split grouping (but within different clusters) will be relatively close together in comparison to points and on opposite sides of some other cluster. The minimum of across a range of therefore picks out , and we argue now that the sensitivity of is largely independent of dimension.

First, we argue that . When , of order pairs of points (from a single cluster spanning two natural groupings) will have a separation greater than the closest pair of points and in different clusters. When , fewer than pairs of points in the different clusters spanning a single natural grouping will be closer together than pairs of points and in some other cluster, as in Figure 6 (b). There is then an upper bound, , which is actually very liberally satisfied in practice. All of this is to conclude that . If we suppose that (this is the case seen empirically for well-separated clusters), it remains to then show that is independent of dimension.

The focus is on how the distances between pairs of points in different clusters of a split grouping compare with the distances between points within the same grouping (as in Figure 6 (b)). Assuming for simplicity that all clusters are of equal size, we can focus on a single Gaussian cluster and study the following problem: for each pair of points and existing in the separate regions of a bisected version of this cluster, how many pairs of points and existing anywhere within the complete cluster have a greater separation, ? The distribution of Euclidean distances between Gaussian-random points has been studied in Thirey, where it is found that as , this distribution resembles a shifted Gaussian with mean, , growing with (in fact, this distribution becomes very nearly normal for relatively low dimensions, ). This shift away from is due to the fact that distances tend to grow as the dimensionality of the space increases: quite simply, there is more space to move around in. Meanwhile, the variance of the shifted Gaussians is approximately independent of , with the result that the number of points with a separation greater then a fixed proportion, , of the mean separation, , is independent of dimension. This is relevant to our question: the number of pairs with separation is constant as dimensionality increases, as is the number of pairs with . The number of all such pairs is . The quantity is therefore independent of dimension.

Figure 7: Accuracy of indices as a function of dimension for three different distance metrics: Euclidean, Minkowski with

, and cosine similarity, when tested against Gaussian clusters immersed in uniform noise.

As with the Ball-Hall and Hartigan indices analyzed in the last section, we see that statistical arguments can be used to analyze the sensitivity of indices based on the number of concordant and/or discordant pairs, like the Baker-Hubert , G+, and . This kind of analysis is useful also for designing validity indices with predictably good high-dimensional behavior.

We also performed the above testing for well-separated, multivariate Gaussian clusters, with data prepared as above but with the standard deviation selected randomly from the uniform interval

for each dimension. The results are qualitatively very similar to the above, again with all indices except the C index either improving or stable with increasing dimension.

In summary, we can conclude from these results that the great majority of relative validity criteria see improved

sensitivity in high-dimensional data spaces with well-separated natural groupings.

3.2 Univariate Gaussian clusters with noise

The above case is an idealized scenario, as few real-world datasets are comprised of clean, well-separated groupings. To create noisy data, we generate univariate Gaussian clusters as above and then add uniform noise over the interval

in each dimension. This is a kind of white noise, present with equal probability in each dimension. Since

-means will assign each noise point to a cluster, when

the clusters will be strongly Gaussian with noisy points occurring as outliers in at least one dimension. The noise level is quantified as a percentage of the number of points in the dataset: since our clusters have on average 200 points, a 1% noise level works out to about 2 noisy points per cluster. Though a seemingly tiny amount of noise, we find that some indices are badly affected at the 1% level. In Figure

7 (top left) the accuracy of each index—the proportion of data instantiations in which the index correctly picked out —is plotted as a function of dimension. Several indices perform poorly, with low accuracy already at that degrades to near 0% in higher dimensions. Meanwhile, a small collection of indices perform well for all : Davies-Bouldin, Krzanowski-Lai, PBM, RMSSTD, RS, and Trace , each maintaining an accuracy above 80% across the range of tested dimensions, showing that noisy data does not compound the curse of dimensionality for these indices.

More interesting are the indices that perform relatively well in low dimensions but degrade as increases: Banfield-Raftery and McClain-Rao are two such indices that degrade rather quickly, whereas BIC, Caliński-Harabasz, and the Hartigan indices degrade more abruptly only when surpasses 100 or so. From Figure 1, we see that uniform data distributions more strongly exhibit the issues of distance concentration and hub formation, suggesting that the uniform noise introduced in this data scheme might be contributing to the reduced accuracy of these indices. To test this, one might replace the Euclidean distance used in these indices with a distance metric that has better high dimensional performance. Several authors have studied and argued for the use of Minkowski metrics with fractional powers in high-dimensional spaces,

(9)

This metric generalizes the familiar Euclidean-squared, or norm, to arbitrary powers. There is some evidence Aggarwal2001; Aggarwal22001; Doherty2004; Verleysen05; Hsu; Jayaram that norms with are more resistant to distance concentration in high dimensions than the norm.

There is also the cosine similarity,

(10)

which associates the closeness of two points with the degree of co-directionality of their vectors. It has found use in high dimensional generalizations of

-means (via the spherical -means algorithm Hornik), in which points are projected onto the surface of a hypersphere via the cosine similarity, and then partitioned on this surface using conventional -means.

To determine whether these metrics might improve cluster validation in high dimensional spaces in the presence of uniform noise, each index in Table 1 was redefined in terms of the Minkowski metric with and the cosine similarity, and the above experiment repeated. Results are shown in Figure 7 (bottom). The Minkowski metric offers some improvement, particularly by recovering the performance of the BIC, Caliński-Harabasz, and Hartigan indices that are flagging in high dimensions with Euclidean distance. Meanwhile, the cosine similarity generally under-performs Euclidean distance at low dimension, but supports a fairly general improvement with for a number of indices.

Figure 8: Mean relative sensitivity of indices from Figure 8 with greater than 80% accuracy with (a) Euclidean distance and (b) Minkowski.

Focusing only on those indices with accuracies surpassing 80% across the full range of tested dimensions, we next examine their sensitivities in Figure 8. While noise does not affect the accuracy of these indices, it does attenuate their sensitivities, causing some based on Euclidean distance to degrade with increasing dimension. Use of Minkowski distance with tends to ease this difficulty, resulting in most indices performing stably across dimensions.

Figure 9: (a) Accuracy and (b) mean relative sensitivity of Euclidean indices when tested against Gaussian clusters with 50% irrelevant features.

As the noise level is increased, performance degrades; for example, at a 10% noise level, all indices drop to below around 20% accuracy across all dimensions. The presence of noise is thus a direct challenge to validity indices even in low dimensions, and is not compounded by increasing dimensionality. In summary, we find that for low levels of noise, certain Euclidean-based indices perform accurately in all tested dimensions, and the fractional Minkowski metric offers both improved accuracy and sensitivity for some of these indices.

3.3 Univariate Gaussian clusters with irrelevant features

It sometimes occurs that not all features follow patterns that are useful for modeling. In the context of clustering problems, these features tend to follow uniform random or other non-central distributions, in contrast to relevant features that tend to be more centrally distributed. The result is that the data do not form well-defined clusters in the irrelevant dimensions. In this section, we study the effect of irrelevant features on cluster validation in high-dimensional spaces by considering two data schemes, in which 10% and 50% of features are irrelevant.

To create these datasets, for each cluster we randomly select a percentage of features (10% or 50%) and draw the values of these features for each data point from the uniform range . All other features are deemed relevant and drawn from a univariate Gaussian distribution as above. The results for validation indices with Euclidean distance are provided in Figure 9 for 50% irrelevant features. Remarkably, accuracies are generally low in low-dimensional feature spaces (a) and, without exception, improve with increasing dimension. Relative sensitivities are also generally increasing functions of , except for the Caliński-Harabasz and Xie-Beni indices (b). Results for 10% irrelevant features are qualitatively similar, with slightly better performance and sensitivity. A small set of indices: C index, G+, and , could not be analyzed because accuracy was zero at , which prevented an evaluation of and hence relative sensitivity, are not defined.

This improvement of accuracy with dimension can be understood as follows. First, notice that irrelevant features tend to deform clusters uniformly along the corresponding dimension. For example, consider a cluster in 3 dimensions that is Gaussian-distributed in two dimensions and uniformly distributed along one (and so has one irrelevant dimension). The resulting distribution is a uniform cylinder with a Gaussian cross-section. Importantly, the effect of irrelevant features is not to create random noise throughout the dataset, but to uniformly stretch and deform clusters. In higher dimensions, the effect of distance concentration comes into play: it acts to de-emphasize the stretched dimensions relative to the relevant features resulting in a more localized cluster. To see this quantitatively, we define the ratio

(11)

which is the maximum distance from the cluster centroid relative to the average distance. As the dimension of the space grows, this ratio tends to unity, as shown in Figure 10 for a single cluster with half of its features irrelevant.

Figure 10: Ratio of Eq. 11 as a function of dimension. The decrease reveals that distortion of clusters due to irrelevant features is mitigated by distance concentration in high dimensions.

As clusters become more localized in higher dimensions, validity measures are able to more accurately resolve them and accuracy improves. Use of the fractional Minkowski metric, shown to be particularly useful for defining nearest neighbors in data with colored noise Francois; Francois2007, offers only marginal improvements to accuracy and sensitivity.

4 Conclusions

In this work, the accuracies and sensitivities of a set of 24 popular cluster validation measures were studied as a function of feature space dimensionality for a variety of data models. For well-separated, multivariate Gaussian clusters, we find that all but one index have either improving or stable sensitivity as the dimensionality of the space is increased. When uniform noise is introduced, performance degrades for some indices in higher dimensions but this can be corrected by replacing Euclidean distance with fractional Minkowski metric with . Finally, indices also perform well for data models with up to 50% irrelevant features, with sensitivities and accuracies that improve with increasing dimension. These results paint an optimistic picture: cluster validation using relative indices is not challenged by distance concentration in high dimensional spaces for Gaussian clusters with white or colored noise. For well-separated Gaussian clusters with or without colored noise (irrelevant features) index sensitivities actually improve with increasing dimension.

While simulated data is no substitute for real-world datasets, it is the only way to systematically isolate and explore the effect of dimensionality on cluster validation. The data schemes considered in this analysis are meant to address both clean and noisy clustering problems; however, the success of relative criteria ultimately depends on the nature and qualities of the dataset. Due to the fact that virtually all tested indices show success in high dimensions, there is considerable room to select indices appropriate to a given problem.

This analysis has considered relative criteria, as opposed to validation methods based on absolute criteria, like g-means or dip-means, which select partitions that are Gaussian and unimodal, respectively. These two methods are known to degrade considerably in higher dimensions Adolfsson, making the success of relative criteria all the more critical. A systematic analysis like that done here for relative criteria might also be conducted for different absolute criteria to further our understanding of their performance.

In conclusion, it is hoped that this study strengthens our confidence in the face of the curse of dimensionality, revealing that quantities based on the Euclidean norm can still be useful in high dimensional spaces if they are defined to scale properly. Happily, many of the well-known criteria are so defined, and therefore find continued applicability even, and particularly for, high-dimensional problems.

5 Appendix

In this section we provide definitions and descriptions of the 24 internal validity indices tested in this work.

5.1 Baker-Hubert

The index measures the correlation between the pairwise distances between points, and whether these pairs belong to the same cluster or not. Defining as the number of times the distance between two points belonging to the same cluster is strictly smaller than the distance between two points not belonging to the same cluster (the number of concordant points) and as the number of times the distance between two points not belonging to the same cluster is strictly smaller than the distance between two points belonging to the same cluster (the number of discordant points), the index is defined,

(12)

5.2 Ball-Hall

The Ball-Hall index is the mean dispersion averaged over clusters,

(13)

5.3 Banfield-Raftery

Similar to the Ball-Hall index, the Banfield-Raftery index is the weighted sum,

(14)

5.4 Bic

The Bayesian information criterion (BIC) measures the optimal number of parameters needed by a model to describe a dataset. It was adapted to the clustering problem for the -means algorithm Pelleg and is written,

(15)

where is a constant, is the number data points, and is the number of dimensions.

5.5 C index

In each cluster , there are pairs of points and there are thus pairs of points in the same cluster across the full dataset. Defining as the sum of the distances between all pairs of point belonging to the same cluster, as the sum of the closest pairs of points in the full dataset, and as the sum of the most-distant pairs of points in the full dataset, the C index is

(16)

5.6 Caliński-Harabasz

With the between-cluster dispersion defined in Eq. 5, the Caliński-Harabasz index is defined

(17)

where WSS is the pooled within-cluster sum of squares, .

5.7 Davies-Bouldin

For each cluster , define the mean distance of the cluster points to the cluster barycenter,

(18)

The distance between barycenters of clusters and is . The Davies-Bouldin index concerns the maximum of the ratio for each (with ) averaged over all clusters,

(19)

5.8 Dunn

These are really a family of measures known as the generalized Dunn indices based on the ratio of some measure of between-cluster () and within-cluster () distances,

(20)

In this study we define to be the maximum distance between two pairs of points in cluster and

(21)

the distance between clusters and defined to be the minimum distance between a point in and a point in . This is known as single linkage.

5.9 G+

Using the definition of defined above for the Baker-Hubert index,

(22)

where is the total number of distinct pairs of points in the full dataset, .

5.10 Isolation index

Denoting the set of the nearest neighbors of a point in cluster by , the quantity is the number of points that are not in . The isolation index is then defined,

(23)

where denotes the set of all data points.

5.11 Krzanowski-Lai

An evaluation of the Krzanowski-Lai index for clusters requires three partitionings, at , , and . Defining , we have

(24)

In principal the value of this index depends on the clustering algorithm selected to perform the partitioning; in this work we chose -means.

5.12 Hartigan

This index is the logarithm of ratio of the between-cluster dispersion and the pooled within-cluster dispersion, both defined earlier,

(25)

5.13 McClain-Rao

As with , the sum of the distances between all pairs of points belonging to the same cluster (defined for the C index above), we can denote the sum of the distances between all pairs of points not belonging to the same cluster as . There are such pairs. The McClain-Rao index is the ratio,

(26)

5.14 Pbm

Let be the maximum distance between any two cluster barycenters, be the sum of the distances of the points of each cluster to their barycenter, and be the sum of the distances of all points to the data barycenter. Then,

(27)

5.15 Point-biserial

This index measures the correlation between the distances between pairs of points and whether the points belong to the same cluster. In terms of quantities defined above,

(28)

5.16 Rmsstd

The root mean square standard deviation (RMSSTD) is defined

(29)

5.17 Rs

The R-squared (RS) index is defined

(30)

where is the sum of the squared distances of each point in the dataset, , from the dataset barycenter, .

5.18 Ray-Turi

With the quantity defined as above for the Davies-Bouldin index, the Ray-Turi measure is defined

(31)

5.19 S_Dbw

The variances of feature across the dataset are collected together into a vector of size , , where is the vector of values for the feature across all data points. The average scattering is then defined,

(32)

where is the variance vector restricted to points belonging to cluster . Next, define the limit value,

(33)

The S_Dbw index is based on the notion of density: the quantity for a given point counts the number of points in clusters and that lie within a distance of the point. This quantity is evaluated at the midpoint, , and at the barycenters, of the two clusters, and , and used to define the quotient,

(34)

The mean of over all clusters is then added to to form the index,

(35)

5.20 Silhouette

The mean distance of a point to other points within the same cluster is

(36)

The mean distance of a point to points in a different cluster, , is . The minimum of these inter-cluster means is denoted . Each point can then be given a silhouette,

(37)

and an overall score defined as the mean silhouette score of a cluster averaged over all clusters,

(38)

5.21

The index is a version of Kendall’s rank correlation coefficient applied to clustering problems; it is defined in terms of previously introduced quantities as

(39)

5.22 Trace W

This index is simply the pooled within-cluster sum of squares,

(40)

5.23 Wemmert-Gançarski

For a point in cluster , define

(41)

Then the Wemmert-Gançarski index is defined

(42)

where .

5.24 Xie-Beni

The Xie-Beni index is defined

(43)

where is defined in Eq. 21.

References