Data science has been around in form and function if not necessarily in name for the better half of the last century. Among the main problems that we encounter is “clustering” which one can describe in layman’s terms as grouping elements from a (data)set into subsets which share some feature. Early uses for data clustering came from the world of particle physics wherein particle accelerators sent large numbers of elementary particles in opposite directions in hopes of some of them colliding with each other to produce nuclear debris. Several questions arose naturally from this process. First; how does one read off the data? Second; if two particles collide, how do we detect this? Third; is a collision head-on or are the particles glancing off each other? Fourth; which data are simply noise? An additional source of data clustering, which arrived later came in the form of astronomy. One can ask, given a telescope image, how do we discern which spots are stars, galaxies, nebulae, planets, etc?
Since digital computers were in the nascence in the late 1970s the algorithms needed to operate quickly and stably, with as little sensitivity to noise as possible. The big constraints then were speed and processing power. It was difficult to run through a small dataset thousands of times. This led to algorithms like CLASSY and k-means (reference to 1979 CLASSY article)
In the early decades of the twenty first century running through datasets of as much as a billion points requires very little runtime, and thus thousands of tests is a feasible computational task. This, however, is buoyed by the fact that datasets are becoming larger and more noisy, to the point where stability and noise sensitivity become even more important. At present there are dozens, if not hundreds, of well-known algorithms for clustering built for tackling specific problems. The question naturally arises as to which algorithm is the best. Perhaps a hybridized approach works for some industries better than others. Thus we have turned the question of clustering into a question of metrology. How does one measure the effectiveness of a particular algorithm? We attempt to answer this question by giving a metric inspired by the points systems for multi-sport events such as pentathlon, heptathlon, and the decathlon. To measure the athletic performance in such an event, we take into account that there are essentially two types of sports contained within (i) those in which the highest score is the winner, (i.e. pole vault, long jump, javelin), (ii) those in which the lowest score is the winner (i.e. hurdles, 100m, 800m, etc). In sports where the highest score is the winner one wishes to add scores: thus one considers a point value given by
In this case, is a reference point, perhaps the lowest qualifying score. For example in the pole vault, one only considers vaults that get above the 4m mark. In order to scale the unit into “points” one uses a scaling factor . The scaling factor also allows one to scale in such a way that vastly scattered scores (such as long jump measured in cm) does not drastically affect the score versus the 100m sprint measured in seconds. Finally, the weight
allows one to scale the score so that roughly every event can have roughly the same mean or same variance as whichever criterion an athletic committee would deem more desirable. One can add more parameters to give the scores some other sort of unity such as maximum range of scores, difference from theoretical maximum, etc. In our scoring we consider only three parameters, for reference, units, and weighting.
Our algorithmic scores are based on their utility for our purposes. The parameters we derive for our algorithm may be modified for any researcher to meet her/his needs. For our purposes we wish to have an algorithm with a high amount of stability. In order to quantify stability, we offer the heuristic definition that given a sufficiently large dataset withfeatures we shall test some subset features at a time and then exchange one feature at a time. For example, given a test dataset with 100 features, we shall test approximately 20 at a time. After we test 20 features we mark the clusters to which each datum belongs. We then repeat the algorithm while testing 20 features, 19 of which are similar. We calculate how many points move from their clusters (considering the 20th cluster to be the “same”). If nearly half of all points move, then we consider this algorithm to score badly in stability. In fact, we will mark this as zero-stable. We test this algorithm multiple times and each time count the proportion of points which move from test run to test run. The average of moving points will be our instability, and thus we define stability of an algorithm as follows:
Where the number of moving points is calculated empirically. Obviously, unless our dataset is sufficiently small, we cannot get the true measure of stability, but we give some confidence interval around the measured mean. From first year statistics we simply apply the student’s-score to find a confidence interval around the measured mean (stability).
where is the number of measurements and
are the measured standard deviation and mean respectively. Thus we set a threshold of allowable error and calculate the number of experiments necessary to achieve the desired precision. For our purposes, all datasets will be significantly larger than 30 points which a first course in elementary statistics tells us is a sufficient quantity to estimate with-score. For the sake of simplicity we shall take and score a 95% (approximately) confidence interval around our measured stability.
Once we have a quantity called stability, we now scale it to give a maximum of 100 points. The choice of 100 is completely arbitrary and may be moved to any number to fit a particular researcher’s needs. Our reference point, as mentioned above will be . That is to say, if on average half of the points move at every exchanged feature then the algorithm will not be considered stable. For example, k-means/medians will be considered highly unstable because the choice of initial means/medians dramatical affects the final outcome. This is particularly noticeable in k-means when considered dimensional subspaces of dimensional data.
Noise sensitivity is among the more difficult qualities to quantify with any consistency. Part of the problem is that we have different types of noise and of utmost importance is the arbitrary nature of the “shape” of a general dataset. Thus measuring noise sensitivity requires a little bit of bootstrapping and we will reverse engineer our datasets to test an algorithm’s efficacy. The initial thought in characterizing an algorithm’s sensitivity to noise saw the assignment of the score
Defining “causes a problem” is already subjective enough in the sense that this can mean widely varying things for different researchers. For example, DBSCAN generally handles noise well unless is too large. In this case, a noisy point can tie together two distinct clusters. This particular type of error can show up in drawing political district maps. The second problem with the initial assignment of score is that, if we know a priori that is, in fact, noise, then we can filter it in preprocessing. In essence the definition is still too heuristic to be computationally useful; that is, it’s not constructive.
The solution we propose to measuring is noise is to begin with a properly clustered dataset of arbitrary size. For example we build a spatial dataset which has “well-defined” clusters of different sizes, shapes, and densities. Once we have a well-clustered dataset we add noise to it and then recluster. The main measure then shall be entropy. More precisely, we will define two probability distributions and measure their Kullback-Leibler divergence.
First, let’s give some definitions and conventions about how we shall calculate sensitivity to noise.
be a dataset of points. Furthermore, let
be a set of points which we consider to be noise. Let be well-clustered with clusters .
We shall assign to the probability mass function which assigns to each cluster its proportional weight.
The entropy associated to is now defined as
Notice that so the negative sign gives us a positive value. This gives us a good measure of the already existing noise or uncertainty present within . Now we introduce the Kullback-Leibler divergence
Given two probability distributions and the Kullback-Leibler divergence is
This gives us a canonical way to measure the “difference” between two random variables with distributionsand . In our case, the distribution will always be the distribution defined by our well-clustered set .
Now we define the sets
so that has noisy points. There are such sets . for given . and if we consider all then there are total noisy sets. Now we create the new sets:
and we make the probability associations
From here we make the formal definition
After having made this formal definition, we realize that it is impossible to calculate this divergence exactly, since we cannot test every possible noisy dataset in existence. Thus we pick some amount of noisy points and compute several different divergences and again estimate the true divergence against our measured divergence using the -statistic.
Thus the larger number of samples we take, the more accurate our measured divergence is against the true value. As a quick counting exercise, let’s pick
then we pick . This gives us a good bound on the number of possible noisy datasets we can choose of size .
where we have used Stirling’s approximation
to compute the second line. Thus we have approximately
possible noisy datasets containing roughly half the noisy points to test. Even in low dimensions (i.e. ) and a small number of clusters, this allows us sufficiently many tests to get a measured divergence to within any chosen accuracy at the cost of running more experiments.
Given that we have a precise measure for divergence we read the consequences of the definition. Having no divergence means that an algorithm is not susceptible to noise. The larger the divergence, the higher the susceptibility to noise. Thus we seek a score in the form
By our choice of we can get a bound on maximal divergence. If our algorithm is highly susceptible to noise then each noisy point will create a new cluster. Then we see
Recalling that we have chosen we find a maximal divergence of
In this case we will pick as our reference point. This will give a minimum score of and a maximal score of . We wish to have the raw score as 100 before weighting so we pick .
We view noise (in)sensitivity as among the most important features in our metric, thus we give the initial weight . Notice here, that and thus we wish to increase its overall importance with a weighting factor, thus we pick It is important to note, however, that divergence is highly sensitive to the initial dataset and so each researcher must adjust according to her or his dataset.
Complexity is a well established concept in the theory of computing. In this case, we shall prefer to use the standard definition of big-Oh. Our choice of big-Oh rather than big-Theta is based on the fact that we want to score our algorithms based on the slowest that they might be. Indeed, a different researcher may prefer to use big-Theta, or perhaps for a widely distributed system, clock time. In all cases, the scoring procedure follows in a nearly identical fashion; the faster the better. In terms of big-Oh we will look only at the exponent.
Consider two functions . We say
if there exists a pair of positive numbers so that
Notice our slight deviation from the standard to . This more properly reflects the fact that is not an equivalence relation since it is not symmetric. This also allows us to avoid “tight” bounds. In essence we are really scoring the highest exponent of versus the highest exponent of . A slight reformulation gives us
Many of the “common” clustering algorithms (e.g. K-means, BIRCH, STING, CURE, wavecluster, Particle Swarm Optimization, Quantum Dynamic Clustering, etc)
have well established complexity classes. Amongst the algorithms we are testing, the complexity classes are of the form
where may be a combination of factors such as dimension of data, branching threshold, minimum number of points, , etc. With the size of the datasets we are scoring, none of these additional factors will be larger than where is some number less than one. We in fact, test a branching factor where we have . The logarithmic factor contributes very little to clock speed, so we score this as where is very small. Of course, when we consider the limit
we can consider to be essentially constant. However, in order to score our algorithms for speed, we need to set our reference point above our largest exponent. The largest algorithm we have tested has the complexity
Thus we have not seen an algorithm larger than where is the size of the dataset. We thus set our reference point at 4. Our score for complexity now takes the form
Many of the algorithms we have tested clock-in around thus we shall set . Furthermore we asses complexity to be highly important, but not at the cost of all other features, thus we set . Giving us a score for complexity as
4.1 Scalability and Dimensionality
In some instances, one may wish to differentiate between row search and column search. For a general algorithm, complexity takes these into account as a single quantity which is superceded by runtime complexity. Thus for the current metric, we allow complexity to take this score into account.
By homogeneity we shall mean the homogeneity of a cluster after having determined by a clustering algorithm. To accomplish this task we will work first on a single cluster. Within this cluster we will determine how to score its individual homogeneity and then we will consider how to score the clusters in ensemble to gain our overall score.
We begin our task by considering the points in our cluster to be the vertices of a graph. In this graph we will build the minimal spanning tree in which the weights are calculated by mutual reachability distance as in HDBSCAN.
Once we have built our minimal spanning tree we compute the mean and variance
of the weights of our minimal spanning tree. Intuitively we guess that having a very small mean implies that our cluster is highly homogeneous. The problem we run into is that we may have a dataset with highly varying densities in different regions. Thus we shall also like to take variance into account. Consider for a moment the following though experiments:
Consider three different clusters each with 100 points. The first is the set of evenly spaced points on a circle of radius , the second is the set of evenly spaced points on a cirle of radius , and the third a set of points on a cirle of radius , but the spacings are random (Gaussian, with fixed variance). Shown here
We see that the smaller circle with evenly spaced points is the most homogeneous. It is now a question of how to rank the other 2. Which is more homogeneous? Are they the same?
In order to answer this question we appeal to the Fisher information metric as in [4, 5]. Given a set of random variables obeying some probability distribution we build a statistical manifold from its parameter space. Given that many datasets we will encounter have a large number of pointswith clusters and we construct the statistical manifold in the variables:
The canonical metric for a statistical manifold is given by
In the case of Gaussian distributions the metric becomes
This gives us a length element of
We recognize that this is the metric for the upper half space model of hyperbolic space, with some scaling factor on the variance. In particular, for a single Gaussian distribution, we can see the scaling in distance becomes 
From our construction within clusters, we have the minimal spanning tree with weights determined by mutual reachability distance.
In this case we know that all distances and variances will be positive, and thus we wish to compute homogeneity as the fisher distance of from the origin. Unfortunately, every point is infinitely far from the origin by the construction of the metric. In order to get a meaningful measure, we shall scale up every variance one exactly one, and calculate distance of a cluster from . That is, we define the score by
Luckily much work has been done in this space and given we have an explicit expression for as
This formula comes from the fact that the hyperbolic distance is given in general by
When is small, we have a highly homogeneous cluster, and when it is large, we have an inhomogeneous cluster. We define the dataset’s homogeneity score by
This gives us the bounds
Recalling that gives us many () highly inhomogeneous clusters we set this as our reference point . Now we wish to score things from zero to one-hundred, so we pick . Finally we give our weight factor as . We pick since it gives our homogeneity score a little more importance than simple linearity, but will give extraordinarily large values only in extreme cases when is very tiny and is large. Even so we’re bounded above by
6 Intercluster Distance
Intercluster distance is mildly important for most needs, but we can still glean some useful information from it. If we have many large distances between clusters this means our algorithm separates clusters well. For example, in a highly dense dataset -means is likely to have very small separation between clusters, but Quantum Dynamic Clustering is likely to have a much large separation of clusters.
In order to define our intercluster distance, we consider the distance between two given clusters .
where is an arbitrarily chosen metric suitable for one’s purpose. In our case we shall choose the simple Euclidean metric.
Our overall distance for a chosen algorithm will be the sum of all intercluster distances.
Note that we only need to consider since and as is a metric . For the sake of computational ease one can simply define and equivalent
The second distance score will be easier to code, although computationally twice as expensive.
We get the bounds:
As there are pairs of clusters to consider. If all clusters are the same distance apart, then we have exactly .
To bring this to a score within we consider our reference point to be since -means will often produce a zero distance. In light of having a potential zero distance we shall scale our score by . Finally, we scale this by . This tells us that we will take the distance at “face-value.” There is some importance to distance, but having a large intercluster distance does not overcome an algorithm’s deficiencies in other scores.
In the current context, we shall take covolume to mean the amount of space which is not filled by our clusters. This is consistent with the physical chemistry definition rather than the number theoretic definition. Thus the scoring is rather straightforward, while the calculation may take a little effort.
Consider a dataset which is a subset of . We calculate the volume as the volume of the convex hull of . Suppose now that is the set of clusters in which we enumerate as
The volume of each cluster is the volume of its convex hull. Thus the covolume of is defined as
We shall score this on as a percentage. The higher the covolume the tighter the cluster bounds which gives us a score of type:
In this case we choose to be our reference point since we have with rare exception a covolume of zero from k-means clustering. That is to say, it does not produce tightly bound clusters. For our purposes we choose since the theoretical maximum is near 1. This will give us a maximum raw score of 100. We recognize that having tightly bound clusters is not of utmost importance and thus we give this a smaller weight, which reduces the percentage.
We then reveal our score for covolume as
The idea of “shape” is complex enough that we only give an overview in this section, for a more complete treatment see 
Once one has properly identified clusters, the metric of scoring how well an algorithm can classify an arbitrary shape arises. This turns out to be amongst the most complicated metrics to score. For example, many survey papers (cf give some references from sofya’s data clustering book) say that an algorithm picks up an arbitrary shape or it does not. We now ask the question, “how arbitrary?” Can we quantify such a statement? It is the belief of the authors that this too, can be computed in a consistent way.
Our strategy is as follows:
Define the boundary points of a cluster
Fit a “smooth enough” curve or surface (or volume) to the boundary.
Compute the integral of curvature squared over the surface.
Defining the clusters is the work of whichever algorithm we are implementing. After this we need to extract the boundary of the cluster. Our initial thought was to compute the Minimal Spanning Tree of the cluster (in the mutual reachability metric) and define the boundary to be all vertices of degree one. A moment of reflection, however will reveal many counterexamples, for example, a long, sparsely connected cluster with two endpoints near each other and all other vertices internal. (include a picture here). A second though was to compute the minimal spanning tree of the entire graph and compute the boundary in the style of Cheeger (give reference)
and picking out the vertices and as boundary points. Again, this fails when considering a graph such as
In the example above we have only two boundary points. This can’t uniquely define a “shape.”
In order to compute shape consistently we shall take the approach of polygonization of the boundary undertaken in . This approach uses the DeLauney Triangulation of the cluster and extracts boundary points by oriented the edges in the triangulation. This is similar to the approach of computing the shared nearest neighbor graph as in , but not with the intent of building a triangulation. One may also wish to consider a barycentric division in higher dimensions. Our algorithm will be to build a triangulation in 2 or 3 dimensions and extract the boundary in the style of .
Once we have extracted the boundary, we need to build a surface which satisfies several criteria. First, we would like the surface to encode some of the more exotic features of the data. That is, we would like to build a curve or surface which is not simply a collection of line segments or planar regions hastily glued together to give a manifold with corners reminiscent of the early 1990s “3D” video games. We will approach this task by the technique of Bzier curves and surfaces .
Once we have a surface with some curvature to it, and a reasonably good fit, we shall calculate the complexity of the shape of our cluster by computing the integral of its curvature squared. This can be done efficiently for a curve . For a surface we present the following calculation.
Let be our surface parametrization
The Gaussian () and mean () curvatures are defined as follow:
Then our curvature is given by the sum squared of principal curvatures
Now we integrate
Once we have scored each cluster we give the overall score by scaling as such
We wish our overall shape score to be as high as possible. A high shape score signifies that an algorithm picks up an arbitrary shape with high precision. For this reason, we shift our score back against the simplest possible shape to pick up, which is a circle (or sphere) of unit radius. In two dimensions the curvature is constant and thus a constant positive unit. Integrating over the circumference we get . For a sphere we have two principal curvatures each of which is identically one thus we are integrating the constant function 2. Integrating over the surface area we get . Since most of our calculations will be over surfaces, we shall set . This would say that an algorithm picking up a single cluster which is a uniform circular object scores as low as possible. There is no practical limit to how high an individual cluster can score, so we scale by dividing by our maximum cluster score. . Finally, we regard this feature with utmost importance as to an algorithm’s overall accuracy, and thus we shall weight it with . If necessary, a researcher should scale this back if the highest score is arbitrarily high.
9 Our Scores
-  Benchmarking Performance of Clustering Algorithms in Python: http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
-  A Fresh Graduate’s Guide to Software Development Tools and Technique:Chapter 6, Scalability, Khare, Huang, Doan, Kanwal, http://www.comp.nus.edu.sg/seer/book/2e/Ch06.%20Scalability.pdf
A New graph-Theoretic Approach to Clustering and Segmentation,
M. Pavan, M. Pelillo,
Learning on Statistical Manifolds for Clustering and Visualization
K. Carter, R. Raich, A.O. Hero
Fisher Information Distance: A Geometrical Reading
S. Costa, S. Santos, J. Strapason
Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection
R. Campello, D. Moulavi, A. Zimek, J. Sander
ACM Transactions on Knowledge Discovery from Data. 10 (1): 1–51. doi:10.1145/2733381.
Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data2003
L. Ertösz, M. Steinbach, V. Kumar
Polygonization of Point Clusters through Cluster Boundary Extraction for Geographical Data Mining I. Lee, V. Estivill-Castro
Efficient Squared Curvature C. Nieuwhuis, E. Toeppe, L. Gorelick, O. Vekslert, Yu. Boykov
The Shape Metric for Clustering Algorithms
C. Alexander, S. Akhmametyeva, 2017