Hierarchical clustering based on pairwise similarities arises routinely in a wide variety of engineering and scientific problems. These problems include inferring gene behavior from microarray data , Internet topology discovery , detecting community structure in social networks , advertising , and database management [5, 6]. Often there is a significant cost associated with obtaining each similarity value. This cost can range from computation time to calculate each pairwise similarity (e.g., phylogenetic tree reconstruction using amino acid sequences ) to measurement load on the system under consideration (e.g., Internet topology discovery using tomographic probes ). In addition, situations where the similarities require an expert human to perform the comparisons results in a significant cost in terms of time and patience of the user (e.g., human perception experiments in ).
, but many proposed techniques are heuristical in nature and do not provide any theoretical guarantees. Derived bounds are available in which robustly finds clusters down to . The main limitation to this approach is that it is only applicable for problems where the user has control over the specific pairs of items to query and when the pairwise similarities are acquired in an online fashion (i.e., one at a time, using past information to inform future samples). In many situations, either this control is not available to the user, or a subset of similarities are acquired in a batch setting (e.g., recommender systems problems ). This motivates resolving the hierarchical clustering given a selected number of pairwise similarities observed at-random, where adaptive control is not available. Specifically, we look to answer the following question: How many similarities observed at-random are required to reconstruct the hierarchical clustering? The work in  developed a novel clustering technique to approximately resolve clusters down to size
from random observations, in contrast we look to reconstruct the clustering hierarchy (to some pruning) exactly with high probability and also examine discovering clusters of size significantly less than.
While results in  indicate that resolving the entire hierarchical clustering using a sampling at-random regime requires effectively all the pairwise similarities, we find that a significant fraction of the clustering hierarchy can be resolved accurately. Specifically, we resolve the similarity sampling rate required given a desired level of clustering resolution. The only restriction we will place on the observed pairwise similarities are that they satisfy the Tight Clustering (TC) condition, which states that intracluster similarity values are greater than intercluster similarity values. This sufficient condition is required for any minimum-linkage clustering procedure, and can commonly be found underlying branching processes where the similarity between items is a monotonic increasing function of the distance from the branching root to their nearest common branch point (such as clustering resources in the Internet ), or when similarity is defined by density-based distance metrics .
Our results show that the hierarchy down to clusters sized a constant fraction of the total number of items (i.e., ) can be resolved with only pairwise similarities observed at-random on average. To find smaller clusters of size , where , we derive bounds which show that only similarities are required in expectation.
2 Hierarchical Clustering and Notation
Let be a collection of items which has an underlying hierarchical clustering denoted as .
A cluster is defined as any subset of . A collection of clusters is called a hierarchical clustering if and for any , only one of the following is true (i) , (ii) , (iii) .
Without loss of generality, we will consider as a complete binary tree, with leaf nodes and where every that is not a leaf of the tree, there exists proper subsets and of , such that , and .
Our measurements will be from the collection of all pairwise similarities between the items in , with denoting the similarity between and and assuming . The similarities must conform to the hierarchy of through the following sufficient condition.
The triple satisfies the Tight Clustering (TC) Condition if for every set of three items such that and , for some , the pairwise similarities satisfies, .
In words, the TC condition implies that the similarity between all pairs within a cluster is greater than the similarity with respect to any item outside the cluster. Under the TC condition, the tree found by agglomerative clustering will match the true clustering hierarchy, . Minimum-linkage agglomerative clustering  is a recursive process that begins with singleton clusters (i.e., the individual items to be clustered). At each step of the algorithm, the pair of most similar clusters associated with the largest observed pairwise similarity are merged. The process is repeated until all items are merged into a single cluster. The main drawback to this technique is that it requires knowledge of all pairwise similarities value (i.e., all values must be known to find the maximum), therefore this methodology will be infeasible for problems where is large, or where there is a significant cost to obtaining each similarity.
To reduce the measurement cost, we consider an incomplete observation of pairwise similarities. For our specific model, we define the indicator matrix of similarity observations, , such that if the pairwise similarity has been observed and if the pairwise similarity is not observed (i.e., unknown). The pairwise similarities are observed uniformly at-random, defining the similarity observation matrix as,
For some probability, .
To reconstruct the clustering from incomplete measurements, this paper focuses on a slightly modified version of minimum-linkage agglomerative clustering. This process can be considered off-the-shelf agglomerative clustering where the pairwise similarities not observed are simply ignored (i.e., unobserved similarities are zero-filled). The methodology is described in Algorithm 1.
2.1 Canonical Subproblem
The intuition for why an incomplete subset of pairwise similarities is useful can be seen when considering the canonical subproblem where a single cluster is split into two subclusters (where and . In order to properly resolve the two clusters, it is necessary for the agglomerative clustering algorithm to have enough pairwise similarities to make an informed decision as to which items belong to which subcluster. We define a sampling graph, , resolved from the pairwise similarity observation matrix .
Consider the pairwise similarity observation matrix , then the sampling graph is defined as graph with nodes where the edge if (i.e., the pairwise similarities was observed) and otherwise (i.e., the pairwise similarity was not observed).
In the context of the sampling graph, , we can state the following proposition with respect to resolving the canonical subproblem.
Consider a cluster consisting of two subclusters, and (such that and ). Then, the agglomerative clustering algorithm will resolve the two subclusters if and only if the sampling subgraphs associated with each subcluster (i.e., for cluster and for cluster ) are both connected.
The intuition behind this proposition is as follows. Any clustering procedure requires enough information to associate each item with the cluster it belongs to. For example, minimum-linkage agglomerative clustering would require that each item observe at least a single similarity with another item in that cluster. But this is not enough, as we also require that one of these two items have an observed similarity with another item in the remainder of the cluster (i.e., one of the items in the cluster, not including the two items paired together), and so on until the all the items can be clustered. In terms of the sampling graph, this is the requirement that a path can be found between items in the same cluster (as all of these pairwise similarities will be greater than any other item outside of this cluster, as stated using the TC condition). It is then obvious that a cluster of items will only be returned if the sampling graph is connected between those items. An example of this clustering can be found in Figure 1.
Alternatively, if a cluster of items is disconnected, into two sampling graph connected components (where and ), then the clustering procedure will not have enough information to merge the items into a single cluster. An example of this incorrect clustering can be found in Figure 2.
3 Main Results
When pairwise similarities are observed uniformly at-random, such that each pairwise similarity is observed with probability , the resulting sampling graph can be considered a bernoulli random graph (where each edge exists with probability ). Using Proposition 1 and prior work on random graph theory , we can state the following theorem.
Consider the quadruple , where the Tight Clustering (TC) condition is satisfied, and is a complete (possible unbalanced) binary tree that is unknown. Then, the agglomerative clustering algorithm recovers all clusters of size of with probability for given sampling satisifies,
The first component of Theorem 3.1 requires that the sampling probability is large enough that a path in the sampling graph can be found with high probability for any collection of items of size .
Given a set of items, then the agglomerative clustering algorithm will recover all clusters of size of with probability given the sampling probability satisfies,
We prove this proposition using prior work on random graph theory in the Appendix (Section 5.2). ∎
While Proposition 2 ensures that there are enough pairwise similarities to determine each leaf cluster (down to size ), to resolve the entire tree structure down to clusters of size we additionally require enough similarities to determine the connectivity between these clusters.
Consider the quadruple , where the Tight Clustering (TC) condition is satisfied, and is a complete (possible unbalanced) binary tree that is unknown. Then, given a set of clusters of size , the clustering structure of pruned to cluster size will be resolved with probability given sampling satisfies,
Consider the clustering structure of and a single cluster of size . At most, there will be other clusters in that must be compared against to construct the clustering hierarchy. Given sampling rate , then at least one item (out of ) must observe at least one pairwise similarity with at least one item in the other cluster. Therefore, to ensure that every cluster satisfies this with probability , using the union bound we require the sampling rate to satisfy,
Combining the results of Propositions 2 and 3, we find the sampling probability rate necessary to ensure with high probability (i.e., ) that all the clusters of size will be resolved, and the clustering hierarchy between these clusters can be reconstructed. This is shown in Equation 2.
3.1 Sampling Rate Required for Given Cluster Sizes
Using the results from Theorem 3.1, we can state the expected number of pairwise similarity measurements needed to observe clusters down to a specified level. For clusters of size , where and , we find the following,
Consider the quadruple , where the Tight Clustering (TC) condition is satisfied, and is a complete (possible unbalanced) binary tree that is unknown. Then, agglomerative clustering recovers all clusters of size (where and ) of with probability given that the total number of items and the pairwise similarities are sampled at-random with probability,
For a constant oversampling factor .
The proof of this theorem can be found in the Appendix (Section 5.2). ∎
To see the improvements of these bounds, consider the following simple example. We want to reconstruct the hierarchical clustering containing items, specifically recovering all clusters of size (using , , and ) with probability . Given oversampling factor and using the results of Theorem 3.2, we find that to resolve this resolution of clustering only requires a similarity sampling rate of , on average observing pairwise similarities. This is a significant savings over standard techniques that require the entire set of similarities.
And finally we consider the ability to find large clusters of size .
Consider the quadruple , where the Tight Clustering (TC) condition is satisfied, and is a complete (possible unbalanced) binary tree that is unknown. Then, agglomerative clustering recovers all clusters of size (where ) of with probability given that the number of items and the pairwise similarities are sampled at-random with probability,
For a constant oversampling factor .
The proof follows from Theorem 3.2. ∎
Therefore, we find that to resolve clusters down to size (for ) requires only randomly chosen pairwise similarities in expectation. For example, to resolve the hierarchical clustering down to clusters of size from a set of items requires (on average) less than of the complete set of pairwise similarities to be observed at-random (given ).
Hierarchical clustering from pairwise similarities is found in disparate problems ranging from Internet topology discovery to bioinformatics. Often these applications are limited by a significant cost required to obtain each pairwise similarity. Prior work on efficient clustering required an adaptive regime where targeted measurements were acquired one-at-a-time. In this paper, we consider the more general problem of clustering from a set of incomplete similarities taken at-random. We present provable bounds demonstrating that resolving large clusters can be determined with only
similarities on average. Future work in this area includes developing efficient clustering techniques that are also robust to outlier measurements and considering alternative sampling methodologies.
5.1 Lemma 1
Given and ,
Consider some constant value , then we want to show that,
Rearranging both sides and given that for .,
Further rearranging and bounding the log function,
If , then this inequality is satisfied as the left-hand term is always negative and the right-hand term is always positive.
Taking the log of both sides,
If , then . Therefore,
If , then , therefore,
Solving numerically, we find that this is satisfied for all if . This proves the result.
5.2 Proof of Proposition 2
We begin by determining the sampling probability, , necessary to ensure that a single set of items can be clustered. This is equivalent to a bernoulli random graph () of size and probability being connected. From , we bound this probability as,
Where . Simplifying in terms of , and using (for all ),
Considering the entire set of items, there will be at most leaf clusters to resolve (where each cluster has items). Therefore, we bound the probability that is disconnected as,
Where, using the union bound, we state that all clusters containing items will be connected with probability .
Bounding , with
being a dummy variable, then
Therefore, we can solve with respect to the probability of observation, , and dummy variable that,
Then rearranging these terms with respect to ,
In terms of the probability of pairwise similarity observation (, where ),
We note that for any choice of , the first term is monotonically decreasing in , while second term is monotonically increasing in . To simplify the analysis, we take,
Plugging this value of C into Equation 9 gives us the result.
5.3 Proof of Theorem 3.2
Using Theorem 3.1, we find the required probability of similarity observation for a given . Given that , we can state,
We now show that the specified sampling probability rate, , is greater than all terms inside this bound.
5.3.1 First Term Derivation
The first term of Equation 11 is satisfied if,
Bounding the logarithm function, we can state that . Therefore,
Therefore, this bound holds if .
5.3.2 Second Term Derivation
The second term,
Again, bounding the log function,
We can then lower bound the term . Given that for any ,
And rearranging this term,
Given that the right hand side is always greater than zero, we can then find that the second term holds if,
Therefore, the second term holds if and .
5.3.3 Third Term Derivation
And finally, the third term can be bounded as,
Bounding the log function,
Then we find that this term is satisfied if
By combining all three bounds, we find that sampling with probability will resolve a pruning of the hierarchical clustering down to clusters of size , if and .
-  H. Yu and M. Gerstein, “Genomic Analysis of the Hierarchical Structure of Regulatory Networks,” in Proceedings of the National Academy of Sciences, vol. 103, 2006, pp. 14,724–14,731.
-  J. Ni, H. Xie, S. Tatikonda, and Y. R. Yang, “Efficient and Dynamic Routing Topology Inference from End-to-End Measurements,” in IEEE/ACM Transactions on Networking, vol. 18, February 2010, pp. 123–135.
-  M. Girvan and M. Newman, “Community Structure in Social and Biological Networks,” in Proceedings of the National Academy of Sciences, vol. 99, pp. 7821–7826.
-  R. K. Srivastava, R. P. Leone, and A. D. Shocker, “Market Structure Analysis: Hierarchical Clustering of Products Based on Substitution-in-Use,” in The Journal of Marketing, vol. 45, pp. 38–48.
-  S. Chaudhuri, A. Sarma, V. Ganti, and R. Kaushik, “Leveraging aggregate constraints for deduplication,” in Proceedings of SIGMOD Conference 2007, pp. 437–448.
-  A. Arasu, C. Ré, and D. Suciu, “Large-scale deduplication with constraints using dedupalog,” in Proceedings of ICDE 2009, pp. 952–963.
-  W. Fitch and E. Margoliash, “Construction of Phylogenetic Trees,” in Science, vol. 155, pp. 279–284.
-  T. Hofmann and J. M. Buhmann, “Active Data Clustering,” in Advances in Neural Information Processing Systems (NIPS), 1998, pp. 528–534.
-  N. Grira, M. Crucianu, and N. Boujemaa, “Active Semi-Supervised Fuzzy Clustering,” in Pattern Recognition, vol. 41, May 2008, pp. 1851–1861.
-  B. Eriksson, G. Dasarathy, A. Singh, and R. Nowak, “Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities,” in Proceedings of AISTATS 2011, April 2011.
-  J. Bennet and S. Lanning, “The Netflix Prize,” in KDD Cup and Workshop, 2007.
-  M. Balcan and P. Gupta, “Robust Hierarchical Clustering,” in Proceedings of the Conference on Learning Theory (COLT), July 2010.
-  R. Ramasubramanian, D. Malkhi, F. Kuhn, M. Balakrishnan, and A. Akella, “On The Treeness of Internet Latency and Bandwidth,” in Proceedings of ACM SIGMETRICS Conference, Seattle, WA, 2009.
Sajama and A. Orlitsky, “Estimating and Computing Density-Based Distance
Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005, pp. 760–767.
-  T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer, 2001.
-  E. Gilbert, “Random graphs,” vol. 30, no. 4. Annuals of Mathematical Statistics, 1959, pp. 1141–1144.