Hierarchical agglomerative clustering involves successively grouping items within a dataset together, based on similarity of the items. The algorithm finishes once all items have been linked, resulting in a hierarchical group similarity structure. Given that two items are merged together, we must determine how similar that merged group is to the remaining items (or groups of items). In other words, we have to recalculate the dissimilarity between any merged points. This dissimilarity between groups can be defined in many ways, and these are known as linkage methods. Standard, established linkage methods include single, complete, average and centroid linkage. Minimax linkage, which was first introduced in (Cheung et al., 2004) and formally analyzed in (Bien and Tibshirani, 2011), will be the subject of our evaluation. We describe hierarchical agglomerative clustering and the linkage methods precisely as follows.
Given items , dissimilarities between each pair and , and dissimilarities between groups and , hierarchical agglomerative clustering starts with each node in a single group, and repeatedly merges groups such that is within a threshold . is determined by the linkage method, defined as follows:
Single linkage . The distance between two clusters is defined as the distance between the closest points across the clusters.
Complete linkage . The distance between two clusters is defined as the distance between the farthest points across the clusters.
Average linkage . The distance between two clusters is defined as the average of the distances across all pairs of points across the clusters.
Centroid linkage , where and . The distance between two clusters is defined as the distance between the centroid (mean) of the points within the first cluster and the centroid (mean) of the points in the second cluster. Often, the centroids have no intuitive interpretation (e.g., when items are text or images).
Minimax linkage , where , the radius of a group of nodes around . Informally, each point belongs to a cluster whose center satisfies .
Bien and Tibshirani (Bien and Tibshirani, 2011) expand upon Ao et al. (Cheung et al., 2004) by providing a more comprehensive evaluation of minimax linkage. In particular, they compare minimax linkage to the standard linkage methods using five data sets and two different evaluation metrics. Additionally (although not the focus of the current paper), the authors prove several theoretical properties, for example that dendrograms produced by minimax linkage cannot have inversions and are robust to some data perturbations. They also perform additional evaluations, compare prototypes to centroids, and benchmark computational speed.
The comparisons of minimax linkage to standard linkage methods in (Bien and Tibshirani, 2011) are summarized in Table 1. For the colon and prostate cancer data sets, distance to prototype was calculated for minimax linkage, but not for the other linkage methods, since those two data sets were used to compare prototypes to centroids, rather than compare the different linkage methods. More details on the data sets and metrics used are in Sections 2 and 3.
|Colon Cancer||Not quite||No|
|Prostate Cancer||Not quite||No|
“Benchmarking in cluster analysis: A white paper”(Mechelen et al., 2018) makes multiple recommendations for analyses of clustering methods. We focus on those for data sets and evaluation metrics.
The first recommendation in (Mechelen et al., 2018) with respect to choosing data sets is to “make a suitable choice of data sets and give an explicit justification of the choices made.” This was not done thoroughly in the original Bien and Tibshirani paper. It was not explained why the particular data sets were chosen for the different evaluations, and features of the data sets were not fully described. In our study, we both add additional data sets and justify existing ones (which include both synthetic and empirical data) in Section 2.2.
With respect to evaluation metrics, (Mechelen et al., 2018) recommends that we think carefully about criteria used and justify our choices. They also recommend that we “consider multiple criteria if appropriate.” Additionally, criteria should be applied across all data sets, and this is one of our main critiques of the existing evaluation, where not all of the data sets used were evaluated on all the criteria suggested (in Table 1, all cells should be “Yes”).
Distance to prototype was well-justified (this is the crux of minimax linkage), but not misclassification rate. While interpretable cluster representatives are important, a researcher may also care about how accurately the algorithm classifies the items in the data set. That being said, when there are a large number of small clusters, the misclassification rate might not be the best measure of performance. In such cases, when working with pairwise comparisons, there is often a large class imbalance problem; most pairs of items do not truly match. A method could achieve a very low misclassification rate simply by predicting all pairs to be non-matches. Therefore we chose to include precision and recall as an additional metric to evaluate clustering quality.
Finally, a suggestion of (Mechelen et al., 2018) is to fully disclose data and code. Unlike the original paper, we supply the code and data that accompanies this paper, for full reproducibility. We have also written an R package, clusterTruster, available on GitHub (https://github.com/xhtai/clusterTruster), which allows the performance of additional evaluations on user-supplied data.
This paper is designed to be a neutral benchmark study of minimax linkage, and the specific contributions are:
An evaluation of all data sets on all of the criteria in (Bien and Tibshirani, 2011)
A better assessment of performance with the utilization of precision and recall
An evaluation on additional (diverse) data sets not in (Bien and Tibshirani, 2011)
Providing publicly available code and an R package that allow for full reproducibility and transparency, while simplifying the process of making additional evaluations on user-supplied data.
2. Benchmark Study
In this benchmark study we both introduce new evaluation metrics (which we apply to every data set), and add new data sets to provide for a more comprehensive analysis. These are detailed as follows.
2.1. Evaluation metrics
to “ Our first improvement to (Bien and Tibshirani, 2011) is to utilize all evaluation metrics provided in the paper on all of the data sets (as opposed to some metrics on some data sets). In other words, any instance of “Not quite” or “No” within Table 1 to should be changed to “Yes.” Additionally, we introduce precision and recall as additional evaluation metrics. The evaluation metrics used are described as follows.
Distance to prototype
The distance to prototype is measured by the maximum minimax radius. The radius of a group of nodes around was defined in Section 1, as . This is the distance of the farthest point in cluster G from point .
The prototype is selected to be the point in G with the minimum radius, and this radius is known as the minimax radius,
(Using this notation, minimax linkage between two clusters and can also be written as .)
Now, for a clustering with clusters, each of the clusters is associated with a minimax radius, . We consider the maximum minimax radius, , in other words the “worst” minimax radius across all clusters. In this sense, the maximum minimax radius can be thought of as a measure of the tightness of clusters around their prototype. A small value indicates that points within the cluster are close to their prototypes, meaning that the prototype is an accurate representation of points within the cluster.
Minimax linkage relies on successively merging clusters to produce the smallest maximum radius of the resulting cluster, so we would expect minimax linkage to perform the best among other linkage methods in terms of producing the smallest maximum minimax radii.
The misclassification rate is defined as the proportion of misclassified examples out of all the examples.
In the clustering context, misclassification rate is defined on pairs of items, specifically we consider each of the pairs, where is the number of individual items, and the outcome of interest is whether the pair is predicted to be in the same cluster or not. A pair is misclassified if the clustering method predicts that the pair is in the same cluster when the true clustering says they are not, or vice versa.
A low misclassification rate typically indicates high accuracy (a good classifier). But, in cases with a large class imbalance (typically many non-matches and few matches) we need to be careful with using misclassification rates because simply classifying all items as non-matches produces a very low misclassification rate.
Precision and recall
To take into account class imbalance, we use the evaluation metrics precision and recall. A typical confusion matrix is below.
Both precision and recall do not include the true negative cell in their calculation and therefore produce fairer estimates of accuracy in class imbalanced data sets, which are common to clustering. The maximum value for precision and recall are both 1, and a good classifier should have high precision and recall.
All , best vs true
Again, define as the number of clusters in the clustering. In (Bien and Tibshirani, 2011), evaluation for distance from prototype was conducted over all possible values of (specifically in a data set of items, ). Misclassification rate however was reported for the best , meaning the lowest misclassification rate over all , and the true , where the ground truth clustering is known.
In this paper we evaluate on all metrics using all , and also report the metrics for true . It is possible to derive measures for the best , but due to the large number of data sets and evaluation metrics used, this became somewhat intractable and was not pursued further, but can be a subject of future work.
2.2. Data sets
In terms of the data sets considered, we use all of the data used in (Bien and Tibshirani, 2011) (except for Grolier Encyclopedia), and introduce additional data sets that exhibit a wider range of data attributes. These additional data sets were included also to ensure that those used in (Bien and Tibshirani, 2011) were not deliberately selected to produce desired results. The Grolier Encyclopedia data set does not include true clusters and was therefore not included in the current paper. Brief descriptions are as follows, and more details for many of the data sets can be found in (Bien and Tibshirani, 2011). A summary of the data is in Table 2.
Olivetti Faces This data contains 400 images of 64 64 pixels. There are 10 images each from 40 people. The pairwise distance measure used is distance. Here we use the data from the RnavGraphImageData package in R.
Colon Cancer The Colon Cancer data set contains gene expression levels for 1000 genes for 62 patients, 40 with cancer and 22 healthy. The pairwise distance measure used is correlation. Here we use the data from the HiDimDA package in R.
Prostate Cancer The Prostate Cancer data contains gene expression levels for 6033 genes for 102 patients, 52 with cancer and 50 healthy. The pairwise distance measure used is correlation. There are multiple versions of the data available online and in R packages. The version we use is from https://stat.ethz.ch/~dettling/bagboost.html, and our results match the resulting plots produced in (Bien and Tibshirani, 2011).
Simulations We repeat the simulations done in (Bien and Tibshirani, 2011)
. These involve three sets of data: spherical, elliptical and outliers. Each data set has 3 clusters of 100 points each in. Both and distances are used as pairwise distance measures. In (Bien and Tibshirani, 2011) simulations were run 50 times each, but here we only ran each once. In future analyses it is possible to perform more runs.
Iris The iris data set (Anderson, 1936; FISHER, [n. d.]) is pre-loaded in R and has been used extensively as an example data set in various applications, including clustering. It contains 50 flowers from each of 3 species. There are four features for each observation, sepal length and width and petal length and width. Here we simply scale and center the features and use distance as a pairwise distance measure.
NBIDE and FBI S&W The National Institute of Standards and Technology (NIST) maintains the Ballistics Toolmark Research Database (https://tsapps.nist.gov/NRBTD), containing images of cartridge cases from test fires of various firearms. These are of 3D topographies, meaning that surface depth is recorded at each pixel location. Each image is approximately 1200 1200 pixels. We use data from two different data sets, NIST Ballistics Imaging Database Evaluation (NBIDE) (Vorburger et al., 2007) and FBI Smith & Wesson. The former contains 12 images each from 12 different firearms, and the latter contains 2 images each from 69 different firearms. We have pre-processed and aligned these images using the R package cartridges3D (available at https://github.com/xhtai/cartridges3D), and extracted a correlation between each pair of images. The resulting pairwise comparison data are available in the clusterTruster package.
|in (Bien and Tibshirani, 2011)?|
|(Roweis)||Yes||4.1cm n = 400, p = 4096|
|image data of human faces|
|(Alon et al. 1999)||Yes||4.1cm n = 62, p = 2000|
|high dimensional data|
|(Singh et al. 2002)||Yes||4.1cm n = 102, p = 6033|
|high dimensional data|
|(Bien et al. 2011)||Yes||[c]4.1cm n = 300, p = 10|
|L-1 and L-2 distance used|
|(Bien et al. 2011)||Yes||[c]4.1cm n = 300, p = 10|
|L-1 and L-2 distance used|
|(Bien et al. 2011)||Yes||[c]4.1cm n = 300, p = 10|
|spherical shape with outliers|
|L-1 and L-2 distance used|
|(Anderson, 1936; Fisher, 1936)||No||[c]4.1cm n = 150, p = 4|
|(Vorburger et al. 2007)||No||[c]4.1cm n = 144, p = 144,000|
|image data of cartridge cases|
|No||[c]4.1cm n = 138, p = 144,000|
|large number of small clusters|
3. Evaluation Results
We report our evaluation results for all data sets and evaluation metrics, using all and true . In Tables 4 and 5, we report the values for the maximum minimax radius, misclassification rate, precision and recall for all linkage types (single, complete, average, centroid, and minimax linkage) for the case where = true for each data set. In Figures 4 through 14, we show the distribution of maximum minimax radius, misclassification, and precision-recall across all possible values of .
3.1. Results for true
It is important to understand how our evaluation metrics change for multiple values of , especially because is often unknown. That being said, it is common to know a plausible range of values and therefore results in out-of-scope regions may be irrelevant. We present the clustering results for the true value of for each data set in Tables 4 and 5 in the Appendix. As an example, we reproduce the results for the Olivetti Faces data set in Table 3.
|Linkage type||Max minimax||Misclass-||Precision||Recall|
In Table 3, we find that hierarchical clustering with minimax linkage produces the smallest maximum minimax radius, indicating that the images within each cluster are close to the cluster’s prototype. This means that the prototype is a good representation of the cluster. Using a prototype is especially useful for interpreting cluster results in cases when averages of the data do not make practical sense (e.g., images, text). Minimax linkage does not produce the lowest misclassification rate, although the rate is comparable to both average and centroid linkage. Because average and centroid linkage result in uninterpretable cluster “representatives,” minimax linkage could holistically be considered the best performer. However, minimax linkage does not produce the highest precision and recall which we argue in Section 2 should also be reported when determining linkage quality.
In Tables 4 and 5, we examine maximum minimax radius (for true ) and find that minimax linkage does not always produce the best results. In the Colon Cancer, Prostate Cancer, Spherical-L2, Sperical-L1 and Elliptical-L2 data sets, other linkage methods produce the smallest maximum minimax radius. We also note that the maximum minimax radius produced by minimax linkage is often (but not always) close to the radius of another linkage method. Bien and Tibshirani (Bien and Tibshirani, 2011) claim that “minimax linkage indeed does consistently better than the other methods in producing clusterings in which every point is close to a prototype,” which we do see across multiple values (Figures 4 through 14; more details in Section 3.2). However, when we look specifically at the true case, which is more relevant in practice, our results dispute this claim.
In terms of the misclassification rate, minimax linkage performs well. In the Elliptical-L2, Spherical-L2, Colon Cancer, and Olivetti Faces datasets, other linkage methods produce smaller misclassification rates, but in each case the minimax linkage rate was close to (within 0.03 of) the best rate. This finding is consistent with the claims in (Bien and Tibshirani, 2011).
In terms of precision, minimax linakge performs worse than other linkage methods in the Olivetti Faces, Colon Cancer, Prostate Cancer, Elliptical-L2 and FBISW data sets. For recall, minimax linakge performs worse than other methods in all data sets except NBIDE.
In summary, we find that for true , minimax linkage does not consistently perform best in terms of smallest maximum minimax radius, highest precision, or highest recall rates. It does consistently perform well in terms of misclassification. One of the core claims in (Bien and Tibshirani, 2011) is that minimax linakge consistently performs best in terms of producing low maximum minimax radius, but we find that it is not consistently the case for the true scenario.
3.2. Results across all
In this section, we look at how the performance metrics change across different values of , as opposed to the true value of in Section 3.1. This could still be relevant in practice where is unknown, or if we want an overall sense of the performance of the method across all possible clusterings.
The full results are in Figures 4 through 14 in the Appendix (Section 5). Here we again use the Olivetti Faces data set to illustrate the results. We found that for true (40), minimax linkage performed best in terms of maximum minimax radius, but not in terms of misclassification, precision, or recall. The results for all are in Figures 1 through 3.
In Figure 1 we find that the line for minimax linkage (smooth dark red line) is consistently lower than the other linkage methods (single, complete, average, and centroid).
In Figure 2, we see that minimax linkage (smooth dark red line) consistently performs similarly to complete linkage (smooth dark purple line) and both methods do better than single and centroid linkage. Average linkage performs similarly to complete and minimax linkage for large values of .
In Figure 3 we plot the precision versus recall graph. For each value of we plot both precision and recall, and connect the values in order of increasing . The area under the curve has maximum value 1, and we want this to be as large as possible. In the Olivetti Faces data set in Figure 3, we see that centroid linkage performs poorly for most values of . Average linkage appears to perform best for most values of although minimax and single linkage also perform well for certain values of .
As mentioned, the full results for all data sets are in Figures 4 through 14 in the Appendix (Section 5). Across all data sets, hierarchical clustering with minimax linkage performs best across most values of in terms of lowest maximum minimax radius. No linkage type stands out as the best for misclassification performance across the different data sets. Similarly for precision and recall, there is no best performing linkage method.
4. Discussion and Conclusion
Bien and Tibshirani analyzed minimax linkage in 2011 (Bien and Tibshirani, 2011), and performed comparisons to standard linkage methods. We expand on this evaluation, taking into account guidelines recommended in “Benchmarking in cluster analysis: A white paper” (Mechelen et al., 2018): we justify choices of data sets and evaluation metrics, apply criteria across all data sets, and fully disclose data and code. Comparing to (Bien and Tibshirani, 2011), we use additional data sets, include precision and recall as additional evaluation metrics, and use all metrics for all data sets. We evaluate on all possible clusterings , as well as the true number of clusterings. We highlight results for the latter case, since metrics for this value of are often more relevant than results across all values of .
One of the main claims of (Bien and Tibshirani, 2011) is that minimax linakge consistently performs best in terms of producing results with low maximum minimax radius, but we find that it is not consistently the case for the true scenario. We do find (similarly to (Bien and Tibshirani, 2011)), that minimax linkage often produces the smallest maximum minimax radius (compared to other linkage methods) across all possible values of . This means that overall, minimax linkage does produce clusters where objects in a cluster are tightly clustered around their prototype. Prototypes are a good representation of their cluster and have good interpretability. We find that minimax linkage performs well in terms of misclassification across all data sets, but it does not always produce high precision and recall (which we suggest should also be reported due to the common class imbalance problem in cluster analysis).
In our comprehensive analysis of minimax linakge we came to two main conclusions.
For true : Minimax linkage does not consistently perform best in terms of smallest maximum minimax radius, highest precision, or highest recall. It does consistently perform well in terms of misclassification.
Across all : Minimax linkage performs best across most values of in terms of lowest maximum minimax radius. No linkage type stands out as the best for misclassification performance, precision, or recall across all data sets.
Future work will include an increased focus on simulations. The priority of this paper was evaluating performance on real clustering applications, and we included one run of the simulations that were done in (Bien and Tibshirani, 2011)
. More work will be done on this in the future, to properly quantify standard errors associated with the metrics. We also noted that it is possible to derive and report measures for the bestas opposed to true , but due to the large number of data sets and evaluation metrics used, this was not pursued but left for future work. Another question that is of interest but was out of the scope of this paper is to analyze cases where the true number of clusters is not known. For example, we might be interested in whether it is easier to discover the number of true clusters using minimax linkage as opposed to other methods, and this is an interesting question that to our knowledge has not been explored. Finally, one issue that was evaluated briefly in (Bien and Tibshirani, 2011), but that we have not focused on, was computational complexity. This will be the subject of future work.
- Anderson (1936) Edgar Anderson. 1936. The Species Problem in Iris. Annals of the Missouri Botanical Garden 23, 3 (1936), 457–509. http://www.jstor.org/stable/2394164
- Bien and Tibshirani (2011) J. Bien and R. Tibshirani. 2011. Hierarchical Clustering With Prototypes via Minimax Linkage. J. Am. Stat. Assoc. 106 495 (2011), 1075–1084.
- Cheung et al. (2004) David Cheung, Ian Melhado, Kevin Yip, Michael Ng, Pak C. Sham, Pui-Yee Fong, and S. I. Ao. 2004. CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs. Bioinformatics 21, 8 (12 2004), 1735–1736. https://doi.org/10.1093/bioinformatics/bti201 arXiv:http://oup.prod.sis.lan/bioinformatics/article-pdf/21/8/1735/693755/bti201.pdf
- FISHER ([n. d.]) R. A. FISHER. [n. d.]. THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS. Annals of Eugenics 7, 2 ([n. d.]), 179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-1809.1936.tb02137.x
- Mechelen et al. (2018) Iven Van Mechelen, Anne-Laure Boulesteix, Rainer Dangl, Nema Dean, Isabelle Guyon, Christian Hennig, Friedrich Leisch, and Douglas Steinley. 2018. Benchmarking in cluster analysis: A white paper. arXiv:arXiv:1809.10496
- Vorburger et al. (2007) T. Vorburger, J. Yen, B. Bachrach, T. Renegar, J. Filliben, L. Ma, H. Rhee, A. Zheng, J. Song, M. Riley, C. Foreman, and S. Ballou. 2007. Surface topography analysis for a feasibility assessment of a National Ballistics Imaging Database. Technical Report NISTIR 7362. National Institute of Standards and Technology, Gaithersburg, MD.
|Data set ( = truth)||Linkage type||Max minimax radius||Misclassification||Precision||Recall|
|Data set||Linkage type||Max minimax radius||Misclassification||Precision||Recall|