In a generic clustering problem we are given a collection of objects and a function measuring the dissimilarity between objects and . The goal is to partition into subsets(“clusters”) such that observations in the same cluster are similar and dissimilar from observations in other clusters.
1.1 Hierarchical Clustering
Clustering methods come in two varieties, and . Flat methods require the user to provide a target number of clusters and will then generate a partition = of . Hierarchical methods differ from flat methods in that they generate a hierarchy of partitions . A sequence of partitions is called hierarchical if each cluster in is the union of clusters in .
Hierarchical methods can be agglomerative or divisive. Hierarchical agglomerative clustering (HAC) methods generate partitions by iterative merging. Initially, every object forms a cluster. Then we repeatedly merge the two clusters with the minimum distance (dissimilarity) until only one cluster is left. Different hierarchical methods differ in the definition of the distance between clusters. We will focus on single linkage clustering where is defined as the minimum distance between an object in and an object in , i.e. with and .
The result of the merge process can be represented as a binary tree with leaves. Each node of the tree represents a subset of the observations, called the node members. Each leave represents an individual observation. Each internal node represents the union of the members of its daughter nodes and is associated with a merge distance, the distance between the two clusters being merged.
A layout of this tree where the root is at the top, the leaves are at the bottom, and the vertical coordinate of an interior node is the merge distance, is called a dendrogram.
Figure 1 shows a dataset (a), the binary tree generated by single linkage clustering (b), the corresponding single linkage dendogram (c), and the partition of the data into two clusters (d).
1.3 Extracting Partitions
Any subtree of a dendrogram defines a partition of ; the members of the leaves are the clusters. The most commonly used pruning method is dendrogram cutting: choose a distance threshold and eliminate all nodes with merge distance less than . Figure 1(d) shows the partition of our sample data set obtained by dendrogram cutting with distance threshold . There are alternative pruning methods; see for example Stuetzle (2003).
1.4 Computing the Single Linkage Dendrogram
The single linkage dendrogram could in principle be computed using the iterative merging algorithm sketched in Section 1.1. In practice, however this is not an attractive option because it requires storing the interpoint distance matrix. An alternative was suggested by Gower and Ross Gower Ross (1969). They proposed to first compute the minimal spanning tree (MST) T of the data and then obtain the single linkage dendrogram by recursive partitioning: break the longest edge of the T, thereby splitting T into two subtrees and , and then apply the splitting operation recursively to the two subtrees. The key advantage of this approach is that the MST can be computed using Prim’s algorithm Prim (1957) without ever storing the interpoint distance matrix. Prim’s algorithm produces a list of MST edges. The remaining problem is to determine the edges of and , and thereby the node members of the corresponding dendrogram nodes. We propose a simple and efficient method of identifying the vertices (and edges) of and using information generated by Prim’s algorithm.
2 Prim’s Algorithm and Prim’s Order
2.1 Prim’s algorithm
Prim’s algorithm finds a minimal spanning tree of a weighted connected graph (In the application of the MST to single linkage clustering, is typically the complete graph over some set of points in the Euclidean space and the edge weights are the Euclidean distances). The algorithm starts a tree fragment by choosing an arbitrary seed vertex and then progressively connects the out-vertices(vertices that have yet been connected) to the fragment. Below is an outline of the algorithm.
Initialization: Choose an arbitrary vertex and set .
Find an edge such that and is minimized.
Add into , into
The MST is unique if all edges have distinct weights.
2.2 Prim’s order
When we apply Prim’s algorithm to , each iteration adds one vertex to the fragment. This defines an order for the vertices which we call Prim’s order (of G). Let denote the position of in Prim’s order. The order depends on the choice of the seed vertex . The seed vertex is arbitrary unless otherwise noted. By default, we define . Prim’s order of vertices also induces an order of the MST edges: For an MST edge , define .
If there are no tied edge weights in , then the MST and Prim’s order of are unique. If there are tied edge weights, then there might be more than one MST as well as more than one Prim’s order for a given seed vertex.
3 Finding Connected Components after Breaking the Longest MST Edge
Based on Prim’s algorithm and Prim’s order, we will introduce a method which can efficiently find two connected components obtained by breaking the longest MST edge. For sake of simplicity, let us assume for the moment that the graphhas no tied edge weights. We will treat the general case in Section 4.
Take a look at Figure 2. Panels (a) and (c) show the sample data from Figure 1 and the MST. The numbers next to the vertices indicate their Prim’s orders. The orders in Panels and are different because the seed vertices are different. The longest edge in (a) (marked in blue) connects vertices with Prim’s order 1 and 4 while the same longest edge in (c) connects vertices with Prim’s order 5 and 3.
When we break the longest edge, we obtain two subtrees which are shown in panels (b) and (d). We notice that the vertices in the two subtrees share a common characteristic. In (b), the vertices in one subtree all have Prim’s order less than 4 and the vertices in the other subtree all have Prim’s order greater than or equal to 4. Similarly, in (d), the vertices in one subtree all have Prim’s order less than 5 and the vertices in the other subtree all have Prim’s order greater than or equal to 5. This suggests that the vertex sets and edge sets of the two subtrees can simply be determined based on their Prim’s orders. Notice that 4 and 5 are the Prim’s orders of the longest edge in both cases. This suggests the following proposition.
Let be a connected edge weighted graph with distinct edge weights. Applying Prim’s algorithm to will result in a unique MST and a unique Prim’s order (for some arbitrary seed vertex). Let be the longest MST edge. Breaking splits into two subtrees and with vertex sets and edge sets . Then
To prove the proposition, we use the following lemma.
Let be a connected edge weighted graph with distinct edge weights. Let be the minimal spanning tree of . If Prim’s algorithm is applied to and with the same seed vertex, then the Prim’s orders of and are the same.
Let be the set of edges in and be the set of edges in . Applying Prim’s algorithm to will result in the MST and the unique Prim’s order . Let be an edge in . Since has distinct edge weights, by removing from , will still be the MST of and will still be the Prim’s order of . Repeatedly remove edges from until . This shows that and have the same Prim’s order. ∎
Proof of Proposition 1.
As shown in Lemma 1, it is sufficient to prove Proposition 1 for the MST rather than the original graph . First, apply Prim’s algorithm to . Without loss of generality, supppose the seed vertex is in . Since all edges in are shorter than and the only edge between and is , this implies that all edges in must be joined to the fragment before . Therefore, all remaining edges in must be joined after . This also implies that all vertices in must be connected before any vertices in . ∎
While we have proved that Proposition 1 holds for breaking the longest edge once, it remains to be shown that the method for finding and stated in Proposition 1 can be applied recursively to and . Note that and are themselves MSTs for their respective vertex sets. Let denote Prim’s order of with seed vertex . Let be Prim’s order of with seed vertex and let be Prim’s order of with seed vertex . Then
In other words, on their respective subtrees, and are ”equivalent” to . This implies that Proposition 1 can be applied recursively until every vertex is an isolated vertex. Hence, we only need to compute the MST for and then use its Prim’s order to identify the connected components after every split. This allows us to construct the single linkage dendrogram efficiently. Instead of storing all the members of each node, it is sufficient to store the range of their Prim’s orders (Figure 4).
4 The Case of Tied Edge Weights
Previously we have assumed that there are no tied edge weights in the graph and therefore the MST and Prim’s orders of (and ) are unique. We will now remove this restriction and show that Proposition 1 still holds, with one alteration.
Figure 5 illustrates the problem. Suppose has 2 longest edges and . Breaking both of them would divide into three subtrees , , and . Let be the number of vertices in the three subtrees. The fact that there are two longest edges causes ambiguities in tree growing and tree cutting.
Suppose we started Prim’s algorithm from a seed vertex in . Eventually we would have to decide whether to add or next, which would result in different Prim’s orders. Let’s assume we picked first. Then Prim’s order would be as in Figure 5.
Now consider the process of tree cutting. Since edges and have the same length, we need to decide which one to break first. If we broke first, the two connected components would be and . If Proposition 1 was true, one of the components should have vertices with Prim’s order less than , and the other one should have vertices with order greater than or equal to . However, this is not the case since the Prim’s orders of are and the Prim’s orders of are . If, on the other hand, we broke first, then the two connected components would be and , and the corresponding Prim’s orders of the vertices would be and . Proposition 1 holds in this case. Notice that . This suggests the following proposition:
Proposition 2 (Generalized version of Proposition 1).
Let be a connected edge weighted graph. Applying Prim’s algorithm to will result in an MST and a Prim’s order . Let be the set of edges with tied longest edge weight. Break the edge with the largest Prim’s order, thereby creating subtrees and with vertex sets and edge sets . Then
We first prove a generalized version of Lemma 1.
Lemma 2 (Generalized version of Lemma 1).
Let be a connected edge weighted graph and be a minimal spanning tree of . If Prim’s algorithm is applied to and with the same seed vertex, then every Prim’s order of is a Prim’s order of and every Prim’s order of is a Prim’s order of .
Let be the set of edges in and be the set of edges in . The proof has two directions.
Applying Prim’s algorithm to will result in an MST and a Prim’s order of . Let be an edge in . Removing from , will still be a MST of and will still be a Prim’s order of . Repeatedly remove edges from until . Then is also a Prim’s order of .
Applying Prim’s algorithm to will define a Prim’s order of . Let be an edge in . Adding to , will still be a MST of and will still be a Prim’s order of . Repeatedly add edges to until . Then is also a Prim’s order of . ∎
Proof of Proposition 2.
Based on Lemma 2, we know any Prim’s order of is a Prim’s order of and any Prim’s order of is a Prim’s order of . Therefore, it suffices to prove Proposition 2 for rather than the original graph . Without loss of generality, the seed vertex is in . We claim that for any such that , must be in . Suppose , then must have a larger Prim’s order than since we must come through before joining any edges in . However, by hypothesis . So this is a contradiction and must be in . For other edges in , since is a longest edge, this implies that won’t be chosen until all the edges in have been chosen. Therefore, the same conclusion is drawn as in Proposition 1. ∎
We address the problem of computing single linkage dendrogram. A possible approach is to:
Form an edge weighted graph over the data, with edge weights reflecting dissimilarities.
Calculate the MST of .
Break the longest edge of thereby splitting it into subtrees , .
Apply the splitting process recursively to the subtrees.
This approach has the attractive feature that Prim’s algorithm for MST construction calculates distances as needed, and hence there is no need to ever store the inter-point distance matrix.
The recursive partitioning algorithm allows us to determine the vertices (and edges) of and . We have shown how this can be done easily and efficiently using Prim’s order generated by Prim’s algorithm without any additional computational cost.
Gower Ross (1969)
mstGower, J. Ross, G.
Minimum Spanning Trees and Single Linkage Cluster Analysis Minimum spanning trees and single linkage cluster analysis.Applied Statistics1854-64.
- Prim (1957) primPrim, R. 1957. Shortest Connection Networks and Some Generalizations Shortest connection networks and some generalizations. Bell System Technical Journal361389-1401.
- Stuetzle (2003) runtpruningStuetzle, W. 2003. Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification20525-47.