Clustering is a fundamental primitive for the unsupervised analysis of datasets, and finds applications in a number of areas including pattern recognition, bioinformatics and biomedicine, data management. In its more general definition, clustering requires to identify groups of elements where each group exhibits high similarity among its members while elements in different groups are dissimilar. Starting from this common definition, several algorithms have been proposed to identify clusters in a dataset , often formalizing clustering as an optimization problem (based on a cost function). The resulting optimization problems are usually computationally hard to solve, and algorithms providing rigorous approximations are often sought in such cases. More recently the focus has been on developing efficient methods that scale to the massive size of modern datasets [3, 17, 4, 7, 18], while still providing rigorous guarantees on the quality of the solution.
A common step after clustering has been performed is clustering evaluation
(sometimes called cluster validation). Cluster validation usually employs an evaluation measure capturing the goodness of a clustering structure. Evaluation measures are classified intounsupervised or internal measures, which do not rely on external information, and supervised or external measures, which assess how well a cluster structure matches the structure defined by external information . While external measures are useful only when additional external knowledge regarding the cluster structure of the data is available, internal measures find application in every scenario.
The most commonly used internal measure for clustering evaluation is the silhouette coefficient  (for brevity, called silhouette in this paper). The silhouette of a clustering is the average silhouette of all elements in the clusters, and, in turn, the silhouette of an element in some cluster is defined as the ratio , where is the average distance of from the other elements of , and is the minimum average distance of from the elements of another cluster . In other words, provides and indication to what extent is closer (on average) to elements in its cluster than to elements in the “closest” cluster . The naïve computation of the silhouette for a clustering of elements requires distance calculations, which is unfeasible for massive datasets. Surprisingly, while several methods have been proposed to efficiently cluster large datasets with rigorous guarantees on the quality of the solution, there are no methods to efficiently approximate the silhouette featuring provably high accuracy.
1.1 Related Work
While the silhouette is one of the most popular internal measures for clustering evaluation [19, 25, 23], the quadratic complexity of the naïve exact calculation makes its use impractical for clusterings of very large datasets. For this reason, some attempts have been made to propose variants that are faster to compute, or to simplify its calculation in some special cases.
Hruschka et al.  present the simplified silhouette for the evaluation of clusterings obtained on the basis of objective functions involving squared Euclidean distances (e.g., -means clusterings ). The simplified silhouette is a variant of the silhouette, where for each element in a cluster, the quantity (resp., ) in the original definition is redefined as the (squared) distance between and the centroid of its cluster (resp., the closest centroid of another cluster). In this fashion, the complexity of the whole computation reduces to , improving substantially over the naïve calculation of the exact silhouette for reasonable values of . While the case studies considered by Hruschka et al.  and Wang et al.  provide empirical evidence that the simplified silhouette can be an effective evaluation measure for clusterings returned by Lloyd’s algorithm , there is no evidence of its effectiveness for other types of clusterings (e.g, clusterings based on other distance functions). In fact, even for clusterings based on the -means function, the analysis of Wang et al.  shows that the discrepancy between the original silhouette and the simplified silhouette can grow very large.
A heuristic trick for speeding-up the computation of the silhouette for clusterings based on Euclidean distances was proposed by Frahling and Sohler . For each element of a cluster , while the term is computed according to its definition, in an attempt to reduce the operations needed to compute the term , the heuristic first determines the average distance between and the elements of the cluster , whose centroid is closest to , and then sets in case the distance between and the centroid of any other cluster is larger than or equal to , since in this case there is no need to compute any other average distance . However, when this is not the case, must be computed according to its definition. Despite the practicality of the heuristic, its worst case complexity remains clearly quadratic.
In Apache Spark111https://spark.apache.org/, one of the most popular programming frameworks for large-scale data processing, optimized methods are provided for computing the silhouette of clusterings under -dimensional squared Euclidean and one formulation of cosine distance. Indeed, in these specific cases, simple algebra suffices to show that precomputing, for each of the clusters, a limited number of values dependent on the coordinates of the cluster’s points, is sufficient to yield a fully parallelizable algorithm featuring work. We wish to observe that computing the silhouette by making use of squared distance measures tends to amplify distances and, consequently, differences between distances. The result is that the module of the silhouette is pushed closer to than what would be obtained with linear distances, thus amplifying positive and negative scores
1.2 Our Contributions
In this work, we target the problem of the efficient computation of an accurate estimate of the silhouette of a given clustering under general metric distances. In this regard, our contribution is fourfold:
We develop the first efficient, sampling-based algorithm for estimating the silhouette with provable approximation guarantees. In our algorithm, we employ a Probability Proportional to Size (PPS) sampling scheme, pioneered in the context of distance query processing . For any fixed , our algorithm approximates the silhouette of a -clustering of elements within an additive error with probability at least , using only distance computations, which constitutes a dramatic improvement over the distance computations required by the naïve exact algorithm.
We generalize our algorithm to compute rigorous approximations of other internal measures, such as cohesion and separation.
We perform an extensive suite of experiments on real and synthetic datasets to assess the effectiveness and efficiency of our algorithm, and to compare it with known approaches for fast silhouette computation/approximation. The experiments show that, in practice, our algorithm provides silhouette estimates featuring very low absolute error (less than
in most cases) with small variance () using a very small fraction of the distance computations required by the exact calculation. Moreover, we demonstrate that the returned estimates are far superior in quality to those returned by a naïve yet natural heuristics based on uniform sampling, or by a generalization of the method in  to general distances, which are both affected by errors that can be so large to render the returned estimates of little use in the most typical scenario of application of the silhouette coefficient, that is, the choice of the correct cluster granularity. We also provide evidence that the MapReduce implementation of our algorithm enables the estimation of the silhouette for clusterings of massive datasets (e.g., 1 billion elements) for which the exact computation is out of reach, and show that our general (unoptimized) distributed implementation is also competitive with the specialized (optimized) Spark routine returning the exact silhouette value only in the case of squared Euclidean distances.
It is important to remark that while previously known approaches to efficiently compute or approximate the silhouette have been developed for specific distance functions, namely squared Euclidean and cosine distances, our algorithm provides provably accurate silhouette estimations for clusterings based on any metric distance.
1.3 Organization of the paper
The rest of the paper is structured as follows. Section 2 contains the description of our proposed strategy for silhouette estimation (Subsection 2.1), its accuracy and performance analysis (Subsection 2.2), the generalization of the approach to other internal evaluation measures (Subsection 2.3), and the MapReduce implementation (Subsection 2.4). Section 3 reports the results of an extensive suite of experiments performed to evaluate the effectiveness and scalability of our silhouette estimation algorithm on synthetic and real datasets. Section 4 concludes the paper with some final remarks.
Consider a metric space with distance function , and let be a dataset of elements. Let also be a -clustering of , that is, a partition of into disjoint non-empty subsets called clusters. A popular measure to assess clustering quality was introduced by Rousseeuw in 1987 . Specifically, the silhouette of an element belonging to some cluster , is defined as
denote, respectively, the average distance of from the other elements in its cluster, and the minimum average distance of from the elements in some other cluster. The silhouette of the clustering is the average silhouette of all the elements of , namely:
From the above definitions it immediately follows that the values , with , and range in . A positive value for implies that has been assigned to an appropriate cluster, while a negative value implies that there might be a better cluster where could have been placed. Therefore, can be interpreted as a measure of the quality of the clustering from the perspective of element . In turn, provides a global measure of the quality of the whole clustering, where a value closer to 1 indicates higher quality. The exact computation of requires distance calculations, which is prohibitive when dealing with large datasets. In the following subsection, we present a randomized strategy to yield an estimate of , which is accurate within a provable error bound, with sufficiently high probability, and amenable to an efficient distributed computation.
2.1 A Fast Algorithm for Silhouette Estimation
Consider the estimation of the silhouette for a -clustering of a set of elements from a metric space . For each and , define
Let belong to a cluster . It is easy to see that the quantities and in the definition of the silhouette can we rewritten as and . Building on this observation, our approach to approximating is based on estimating the ’s by exploiting the Probability Proportional to Size (PPS) sampling strategy inspired by the work of Chechik et al. . In particular, we determine a collection of small samples, one for each cluster of , which guarantees that can be estimated within a user-bounded error, with high probability. The details are given in what follows.
Consider a fixed error tolerance threshold and a probability . Our algorithm, dubbed PPS-Silhouette (see Algorithm 1 for the pseudocode), consists of two steps. In Step 1, for each cluster , the algorithm computes a sample of expected size for a suitably chosen constant , while in Step 2 the ’s are used to approximate the ’s. In Step 1, each cluster is processed independently. If , then is set to , otherwise, the following actions are performed.
An initial sample is obtained by performing Poisson sampling over where each is included in independently with probability . (This initial sample will contain an element relatively close to a majority of the elements of , with sufficiently high probability, a property which is necessary to enforce the quality of the final sampling.)
For each the value is computed. Then, for each , a PPS coefficient is computed as the maximum between (which corresponds to uniform sampling) and the maximum of the values , with , which represent the relative contributions of to the ’s of the sample points.
Poisson sampling is performed over , where each element is included in the sample independently with probability .
In Step 2, for each and each cluster , the sample is used to compute the value
which is an accurate estimator of , as will be shown by the analysis. Once all these values have been computed, then, for each belonging to a cluster we compute the estimates
which are in turn used to estimate the silhouette as
Finally, we estimate the silhouette of the whole clustering as
In this section we show that, with probability , the value computed by PPS-Silhouette approximates the true silhouette within a small error bound, expressed in terms of . The key ingredient towards this goal, stated in the following theorem, is a probabilistic upper bound on the relative error of the estimate with respect to true value .
There is a suitable choice of the constant in the definition of the expected sample size used by PPS-Silhouette, which ensures that, with probability at least , for every element and every cluster , the estimate is such that
The proof mimics the argument devised in . Recall that . Consider an arbitrary cluster with more than elements (in the case the theorem follows trivially). For an element , let denote the median of the distances from to all other elements of . Element is called well positioned if . It is easy to see that at least half of the elements of are well positioned. Hence, the initial random sample will contain a well positioned element with probability at least . An easy adaptation of the proof of [8, Lemma 12], shows that if contains a well positioned element and is a suitable constant, then the Possion sample computed with the probabilities derived from is such that
with probability at least . By the union bound, it follows that the probability that there exists a cluster such that the initial sample does not contain a well positioned element is at most . Also, by conditioning on the fact that for all clusters the initial sample contains a well positioned node, by using again the union bound we obtain that the probability that there exists an element and a cluster for which is at most , which concludes the proof. ∎
From now on, we assume that the relative error bound stated in Theorem 1 holds for every element and every cluster , an event which we will refer to as event . Consider an arbitrary element belonging to some cluster and let , , and be the values computed by PPS-Silhouette.
If event holds, then
The proof follows immediately from the definition of and the relative error bound assumed for .
If event holds, then
Consider an arbitrary cluster and let and . As in the previous lemma we can immediately see that
Recall that and that . Suppose that and , for some, possibly different, clusters and . We have:
and the lemma follows. ∎
Now we establish a bound on the relative error for the term appearing in the denominator of the formula defining . Define and . We have:
If event holds, then
Suppose that , hence (the case can be dealt with similarly). If , the bound follows directly from Lemma 1. Instead, if , hence , the bound follows since
As a last technical step, the following lemma establishes an error bound on the estimation of the silhouette .
If event holds, then
An upper bound to the absolute error incurred when estimating through the value computed by PPS-Silhouette, is established in the following theorem, whose proof is an immediate consequence of the definition of the two quantities (Equations 1 and 2) and of Theorem 1 and Lemma 4.
Let be a dataset of elements, and let be a -clustering of . Let be the estimate of the silhouette of the clustering computed by PPS-Silhouette for given parameters and , with , and for a suitable choice of constant in the definition of the sample size. Then,
with probability at least .
We now analyze the running time of PPS-Silhouette, assuming that the distance between two points can be computed in constant time. In Step 1, the running time is dominated by the computation of the distances between the points of each sufficiently large cluster and the points that form the initial sample . A simple application of the Chernoff bound shows that, with high probability, from each such cluster , points are included in . Thus,
distance computations are performed, altogether. For what concerns Step 2, its running time is dominated by the computation of the distances between all points of and the points of the union of all samples extracted from the clusters. A simple adaptation of the proof of [8, Corollary 11] and a straightforward application of the Chernoff bound shows that there are sample points overall, with high probability, where
is the expected sample size for each cluster. Therefore, Step 2 performs
distance computations overall. As a consequence, the running time of the algorithm is which, for reasonable values of , and , is a substantial improvement compared to the quadratic complexity of the exact computation.
2.3 Generalization To Other Measures
The PPS-based sampling strategy adopted for approximating the silhouette of a clustering can be applied to other measures used for internal clustering evaluation, which are based on sums of distances between points of the dataset. This is the case, for instance, of measures that compare the cohesion (i.e., the average intracluster distance) and the separation (i.e., the average intercluster distance) of the clustering .
More precisely, consider a -clustering and define the cohesion and separation of as
respectively. (These measures have been also employed to assess the average cluster reliability for a clustering of a network where distances correspond to connection probabilities [15, 6].) We can rewrite the above measures in terms of the sums defined in the previous section, as follows
Clearly, approximations of the ’s with low relative errors immediately yield approximations with low relative error for and . Specifically, define and as the respective approximations to and obtained by substituting each occurring in the above equations with the value computed within PPS-Silhouette. The following theorem is an immediate consequence of Theorem 1.
Let be a dataset of elements, and let be a -clustering of . Let and be the respective approximations to and based on the values computed within PPS-Silhouette for given parameters and , with , and for a suitable choice of constant in the definition of the sample size. Then,
with probability at least .
2.4 Map-Reduce Implementation
A MapReduce algorithm [9, 13, 20, 14] executes in a sequence of parallel rounds. In a round, a multiset of key-value pairs is first transformed into a new multiset of key-value pairs by applying a given map function independently to each pair, and then into a final multiset of pairs by applying a given reduce function independently to each subset of pairs of having the same key. The application of a reduce function to a set of pairs with same key is referred to as a reducer . The model features two parameters, , the local memory available for each application of a map/reduce function, and , the aggregate memory ever required over the entire course of the algorithm. In what follows, we describe a 4-round MapReduce implementation of PPS-Silhouette, analyzing the memory requirements of each round222We remark that the MapReduce algorithms presented in this paper also afford an immediate implementation and similar analysis in the Massively Parallel Computation (MPC) model , which is popular in the database community..
Throughout the algorithm, a key-value pair will be written as , where is an -dimensional value. Consider a -clustering of a dataset of elements. We assume that, initially, the clustering is represented by the following set of key-value pairs:
We also assume that the values , , used in PPS-Silhouette, as well as all ’s, are given to the program. In order to exploit parallelism, our implementation partitions the points of into subsets of size each, to be processed in parallel, where is a design parameter which may be used to exercise suitable parallelism-memory tradeoffs. The four rounds are described in detail below.
Map: Map each pair into the pair . Also, with probability select to be part of the initial Poisson sample , and produce the additional pairs , with .
Reduce: Let be the set of elements for which there is a pair . (Observe that the ’s form a balanced partition of into subsets.) For each pair compute the sum of the distances from to all elements of and produce the pair . Instead, for each pair produce the pair .
By the analysis of the previous subsection, we know that, with high probability, points have been selected to be part of the initial samples ’s (a copy of all these points, represented by pairs with 1 at the end, exists for each key ). Therefore, this round requires local memory , with high probability, and aggregate memory .
Map: Map each pair into itself, and each pair into the pairs , with .
Reduce: Each reducer, corresponding to some key , now contains all the information needed to compute the values for each cluster and element . This in turn implies that the reducer can also compute all sampling probabilities for , by executing the operations specified in PPS-Silhouette. The reducer computes these probabilities and produces the pairs , one for each .
By the observations made at the end of Round 1, this round requires local memory , with high probability, and aggregate memory .
Map: Map each pair into the pair . Also, with probability select to be part of the Poisson sample , and produce the additional pairs , with .
Reduce Phase: Each reducer, corresponding to some key , now contains all the information needed to compute the approximate sihouette values for each element in . The reducer computes the sum of these values, denoted by , and produces the single pair .
By the analysis of the previous subsection, we know that, with high probability, the size of union of the Poisson samples ’s is , with . Therefore, this round requires local memory , with high probability, and aggregate memory .
Map: identity map.
Reduce: The reducer corresponding to key 0 computes and returns .
This round requires .
Overall, the above 4-round MapReduce algorithm requires local memory , with high probability, and aggregate memory . It is easy to see that for fixed values of and , and for , by choosing , we obtain local memory and linear aggregate memory, with high probability. We remark, that for reasonable values of , the required local memory is substantially sublinear, which is the “holy grail” of MapReduce algorithms .
3 Experimental Evaluation
We ran an extensive experimental evaluation of our PPS-Silhouette algorithm. The goals of our experimental evaluation are threefold: first, to evaluate the accuracy of the approximation of the silhouette provided by our algorithm and to compare it against known heuristic baselines and exact (specialized) methods; second, to assess the impact of using the approximation provided by our algorithm in place of the exact value of the silhouette in a typical scenario (i.e., to identify the most suitable number of clusters); third, to evaluate experimentally the scalability of our MapReduce implementation on a distributed platform. The rest of this section is organized as follows: Subsection 3.1 describes the heuristic baselines considered in the evaluation of our algorithm; Subsection 3.2 provides details on the implementation and on the software/hardware environment used; Subsection 3.3 describes the datasets considered in our evaluation and the parameters used for their analysis; Subsection 3.4 discusses the results on the quality of the approximation provided by our algorithm; and, finally, Subsection 3.5 shows the scalability results.
We gauge the performance of our PPS-Silhouette algorithm against two heuristics: one based on uniform sampling, and the other based on the simplified silhouette introduced in  for the evaluation of -means clusterings, generalized to apply to arbitrary clusterings and metric spaces. More precisely, the first heuristics consists essentially in the execution of Step 2 of PPS-Silhouette, where the samples are chosen via a uniform Poisson sampling, using the same probability for each . The comparison with this heuristics aims at highlighting the added value of PPS vs straightforward uniform sampling. We refer to this heuristics as the uniform sampling algorithm. The second heuristics adapts the simplified silhouette in  by substituting the use of the centroid of each cluster (which makes sense only when squared Euclidean distances are used) with that of the medoid , defined as
The simplified silhouette of a point in cluster is then estimated as , where and . The simplified silhouette of the clustering is the arithmetic average of the simplified silhouette of its elements.
3.2 Implementation and Environment
For the assessment of the accuracy of our approach, we devised Java-based sequential implementations of PPS-Silhouette, the two heuristic baselines described in the previous subsection, and the exact computation of the silhouette (based on the definition). Also, for the assessment of scalability, we devised a MapReduce implementation333All our implementations are available at https://github.com/CalebTheGame/AppxSilhouette of PPS-Silhouette using the Apache Spark framework (with Java)444https://spark.apache.org. In the implementations of PPS-Silhouette, the size of the initial samples is obtained by fixing . Moreover, rather than fixing error tolerance thresholds , we varied the accuracy by setting different values for the expected size of samples .
The scalability experiments were run on CloudVeneto555http://cloudveneto.it, an institutional provider of cloud computing services, that granted us access to 9 PowerEdge M620 nodes, each with octa-core Intel Xeon E5-2670v2 2.50 GHz and 16 GB of RAM, running Ubuntu 16.04.3 LTS with Hadoop 2.7.4 and Apache Spark 2.4.4.
3.3 Datasets and Parameters
In our experimental evaluation we considered both synthetic datasets and real datasets. Synthetic datasets have been chosen so to contain a few outlier points, with the intent of making the accurate estimation of the silhouette more challenging. Specifically, for different values of(ranging from tens of thousands to one billion), we generated points uniformly at random within the sphere of unit radius, centered at the origin of the 3-dimensional Euclidean space, and random points on the surface of the concentric sphere of radius . For what concerns real datasets, we used reduced versions of “Covertype" and “HIGGS" datasets666https://archive.ics.uci.edu/ml/datasets/. The former contains 100000 55-dimensional points, corresponding to the tree observations from four areas of the Roosevelt National Forest in Colorado; the latter contains 500000 7-dimensional points and is used to train learning algorithms for high-energy Physics experiments (as in previous works [7, 17], only the 7 summary attributes of the original 28 have been retained). For all datasets, the clusterings used to test the algorithms have been obtained by applying the
-medoids clustering algorithm implemented in the Java Machine Learning Library777https://github.com/AbeelLab/javaml , using the Euclidean distance.
3.4 Quality of Approximation
In the first experiment, aimed at demonstrating the superiority of PPS over uniform sampling, we compared the accuracy of the silhouette estimations returned by our PPS-Silhouette algorithm and by the uniform sampling algorithm, using synthetic data. In order to make the computation of the exact value feasible, we considered instances of the synthetic dataset with a relatively small number of elements () and -clusterings with . We ran both PPS-Silhouette and the uniform sampling algorithm with values of the (expected) sample size . We evaluated accuracy of an estimation thorugh the absolute error . Table 1 reports maximum, average, and variance of the absolute error over 100 runs of the two algorithms, for each configuration of parameters and .