Clustering is one of the most fundamental and popular methods for data classification. In this paper we initiate a rigorous theoretical study of clustering with the help of an oracle, a model that saw a recent surge of popular heuristic algorithms. ††footnotetext: A prior version of this work appeared in arxiv previously , see https://arxiv.org/abs/1604.01839. This paper contains a new efficient Monte Carlo algorithm that has not appeared before, and a stronger lower bound. Some proofs have been rewritten for clarity.
Suppose we are given a set of points, that need to be clustered into clusters where
is unknown to us. Suppose there is an oracle that either knows the true underlying clustering or can compute the best clustering under some optimization constraints. We are allowed to query the oracle whether any two points belong to the same cluster or not. How many such queries are needed to be asked at minimum to perform the clustering exactly? The motivation to this problem lies at the heart of modern machine learning applications where the goal is to facilitate more accurate learning from less data by interactively asking for labeled data, e.g., active learning and crowdsourcing. Specifically, automated clustering algorithms that rely just on a similarity matrix often return inaccurate results. Whereas, obtaining few labeled data adaptively can help in significantly improving its accuracy. Coupled with this observation, clustering with an oracle has generated tremendous interest in the last few years with increasing number of heuristics developed for this purpose[21, 37, 13, 39, 40, 18, 36, 12, 20, 28]. The number of queries is a natural measure of “efficiency” here, as it directly relates to the amount of labeled data or the cost of using crowd workers –however, theoretical guarantees on query complexity is lacking in the literature.
On the theoretical side, query complexity or the decision tree complexity is a classical model of computation that has been extensively studied for different problems[16, 4, 8]. For the clustering problem, one can obtain an upper bound of on the query complexity easily and it is achievable even when is unknown [37, 13]: to cluster an element at any stage of the algorithm, ask one query per existing cluster with this element (this is sufficient due to transitivity), and start a new cluster if all queries are negative. It turns out that is also a lower bound, even for randomized algorithms (see, e.g., ). In contrast, the heuristics developed in practice often ask significantly less queries than . What could be a possible reason for this deviation between the theory and practice?
Before delving into this question, let us look at a motivating application that drives this work.
A Motivating Application: Entity Resolution. Entity resolution (ER, also known as record linkage) is a fundamental problem in data mining and has been studied since 1969 . The goal of ER is to identify and link/group different manifestations of the same real world object, e.g., different ways of addressing (names, email address, Facebook accounts) the same person, Web pages with different descriptions of the same business, different photos of the same object etc. (see the excellent survey by Getoor and Machanavajjhala ). However, lack of an ideal similarity function to compare objects makes ER an extremely challenging task. For example, DBLP, the popular computer science bibliography dataset is filled with ER errors . It is common for DBLP to merge publication records of different persons if they share similar attributes (e.g. same name), or split the publication record of a single person due to slight difference in representation (e.g. Marcus Weldon vs Marcus K. Weldon). In recent years, a popular trend to improve ER accuracy has been to incorporate human wisdom. The works of [39, 40, 37] (and many subsequent works) use a computer-generated similarity matrix to come up with a collection of pair-wise questions that are asked interactively to a crowd. The goal is to minimize the number of queries to the crowd while maximizing the accuracy. This is analogous to our interactive clustering framework. But intriguingly, as shown by extensive experiments on various real datasets, these heuristics use far less queries than [39, 40, 37]–barring the theoretical lower bound. On a close scrutiny, we find that all of these heuristics use some computer generated similarity matrix to guide in selecting the queries. Could these similarity matrices, aka side information, be the reason behind the deviation and significant reduction in query complexity?
Let us call this clustering using side information
, where the clustering algorithm has access to a similarity matrix. This can be generated directly from the raw data (e.g., by applying Jaccard similarity on the attributes), or using a crude classifier which is trained on a very small set of labelled samples. Let us assume the following generative model of side information: a noisy weighted upper-triangular similarity matrix, , where is drawn from a probability distribution if belong to the same cluster, and else from . However, the algorithm designer is given only the similarity matrix without any information on and . In this work, one of our major contributions is to show the separation in query complexity of clustering with and without such side information. Indeed the recent works of [18, 32] analyze popular heuristic algorithms of [37, 40] where the probability distributions are obtained from real datasets which show that these heuristics are significantly suboptimal even for very simple distributions. To the best of our knowledge, before this work, there existed no algorithm that works for arbitrary unknown distributions and with near-optimal performances. We develop a generic framework for proving information theoretic lower bounds for interactive clustering using side information, and design efficient algorithms for arbitrary and that nearly match the lower bound. Moreover, our algorithms are parameter free, that is they work without any knowledge of , or .
Connection to popular community detection models. The model of side information considered in this paper is a direct and significant generalization of the planted partition model, also known as the stochastic block model (SBM) [27, 15, 14, 2, 1, 24, 23, 11, 33]. The stochastic block model is an extremely well-studied model of random graphs which is used for modeling communities in real world, and is a special case of a similarity matrix we consider. In SBM, two vertices within the same community share an edge with probability , and two vertices in different communities share an edge with probability , that is is and is . It is often assumed that , the number of communities, is a constant (e.g. is known as the planted bisection model and is studied extensively [1, 33, 15] or a slowly growing function of (e.g. ). The points are assigned to clusters according to a probability distribution indicating the relative sizes of the clusters. In contrast, not only in our model and can be arbitrary probability mass functions (pmfs), we do not have to make any assumption on or the cluster size distribution, and can allow for any partitioning of the set of elements (i.e., adversarial setting). Moreover, and are unknown. For SBM, parameter free algorithms are known relatively recently for constant number of linear sized clusters [3, 23].
There are extensive literature that characterize the threshold phenomenon in SBM in terms of and for exact and approximate recovery of clusters when relative cluster sizes are known and nearly balanced (e.g., see  and therein for many references). For and equal sized clusters, sharp thresholds are derived in [1, 33] for a specific sparse region of and ††Most recent works consider the region of interest as and for some .. In a more general setting, the vertices in the th and the th communities are connected with probability and threshold results for the sparse region has been derived in  - our model can be allowed to have this as a special case when we have pmfs
s denoting the distributions of the corresponding random variables. If an oracle gives us some of the pairwise binary relations between elements (whether they belong to the same cluster or not), the threshold of SBM must also change. But by what amount? This connection to SBM could be of independent interest to study query complexity of interactive clustering with side information, and our work opens up many possibilities for future direction.
Developing lower bounds in the interactive setting appears to be significantly challenging, as algorithms may choose to get any deterministic information adaptively by querying, and standard lower bounding techniques based on Fano-type inequalities [9, 30] do not apply. One of our major contributions in this paper is to provide a general framework for proving information-theoretic lower bound for interactive clustering algorithms which holds even for randomized algorithms, and even with the full knowledge of and . In contrast, our algorithms are computationally efficient and are parameter free (works without knowing and ). The technique that we introduce for our upper bounds could be useful for designing further parameter free algorithms which are extremely important in practice.
Other Related works. The interactive framework of clustering model has been studied before where the oracle is given the entire clustering and the oracle can answer whether a cluster needs to be split or two clusters must be merged [7, 6]. Here we contain our attention to pair-wise queries, as in all practical applications that motivate this work [39, 40, 21, 37]. In most cases, an expert human or crowd serves as an oracle. Due to the scale of the data, it is often not possible for such an oracle to answer queries on large number of input data. Only recently, some heuristic algorithms with -wise queries for small values of but have been proposed in , and a non-interactive algorithm that selects random triangle queries have been analyzed in . Perhaps conceptually closest to us is a recent work by Asthiani et al. , that was done independently of ours and appeared subsequent to a previous version of this work . In , pair-wise queries for clustering is considered. However, their setting is very different. They consider the specific NP-hard -means objective with distance matrix which must be a metric and must satisfy a deterministic separation property. Their lower bounds are computational and not information theoretic; moreover their algorithm must know the parameters. There exists a significant gap between their lower and upper bounds: vs , and it would be interesting if our techniques can be applied to improve this.
Here we have assumed the oracle always returns the correct answer. To deal with the possibility that the crowdsourced oracle may give wrong answers, there are simple majority voting mechanisms or more complicated techniques [36, 12, 20, 28, 10, 38] to handle such errors. If we assume the errors are independent-since answers are collected from independent crowdworkers, then we can simply ask each query times and take the majority vote as the correct answer according to the Chernoff bound. Here our main objective is to study the power of side information, and we do not consider the more complex scenarios of handling erroneous oracle answers.
Contributions. Formally the problem we study in this paper can be described as follows.
Problem 1 (Query-Cluster with an Oracle).
Consider a set of elements with latent clusters , , where is unknown. There is an oracle that when queried with a pair of elements , returns iff and belong to the same cluster, and iff and belong to different clusters. The queries can be done adaptively. Consider the side information , where the th entry of , is a random variable drawn from a discrete probability distribution if belong to the same cluster, and is drawn from a discrete††our lower bound holds for continuous distributions as well. probability distribution ††for simplicity of expression, we treat the sample space to be of constant size. However, all our results extend to any finite sample space scaling linearly with its size. if belong to different clusters. The parameters and are unknown. Given and , find such that is minimum, and from the oracle answers and it is possible to recover , .
Without side information, as noted earlier, it is easy to see an algorithm with query complexity for Query-Cluster. When no side information is available, it is also not difficult to have a lower bound of on the query complexity. Our main contributions are to develop strong information theoretic lower bounds as well as nearly matching upper bounds when side information is available, and characterize the effect of side information on query complexity precisely.
Upper Bound (Algorithms). We show that with side information , a drastic reduction in query complexity of clustering is possible, even with unknown parameters , , and . We propose a Monte Carlo randomized algorithm that reduces the number of queries from to , where is the Hellinger divergence between the probability distributions , and , and recovers the clusters accurately with high probability (with success probability ) without knowing , or (see, Theorem 1). Depending on the value of , this could be highly sublinear in . Note that, the squared Hellinger divergence between two pmfs and is defined to be,
We also develop a Las Vegas algorithm, that is one which recovers the clusters with probability (and not just with high probability), with query complexity . Since and can be arbitrary, not knowing the distributions provides a major challenge, and we believe, our recipe could be fruitful for designing further parameter-free algorithms. We note that all our algorithms are computationally efficient - in fact, the time required is bounded by the size of the side information matrix, i.e., .
Let, the number of clusters be unknown and and be unknown discrete distributions with fixed cardinality of support. There exists an efficient (polynomial-time) Monte Carlo algorithm for Query-Cluster that has query complexity and recovers all the clusters accurately with probability . Plus there exists an efficient Las Vegas algorithm that with probability has query complexity .
Lower Bound. Our main lower bound result is information theoretic, and can be summarized in the following theorem. Note especially that, for lower bound we can assume the knowledge of in contrast to upper bounds, which makes the results stronger. In addition, and can be discrete or continuous distributions. Note that, when is close to , e.g., when the side information is perfect, no queries are required. However, that is not the case in practice, and we are interested in the region where and are “close”, that is is small.
Assume . Any (possibly randomized) algorithm with the knowledge of and the number of clusters , that does not perform expected number of queries, will be unable to return the correct clustering with probability at least . And to recover the clusters with probability , the number of queries must be .
The lower bound therefore matches the query complexity upper bound within a logarithmic factor.
Note that, when no querying is allowed, this turns out exactly to be the setting of stochastic block model though with much general distributions. We have analyzed this case in Appendix A. To see how the probability of error must scale, we have used a generalized version of Fano’s inequality (e.g., ). However, when the number of queries is greater than zero, plus when queries can be adaptive, any such standard technique fails. Hence, significant effort has to be put forth to construct a setting where information theoretic minimax bounds can be applied. This lower bound could be of independent interest, and provides a general framework for deriving lower bounds for fundamental problems of classification, hypothesis testing, distribution testing etc. in the interactive learning setting. They may also lead to new lower bound proving techniques in the related multi-round communication complexity model where information again gets revealed adaptively.
Organization. The proof of the lower bound is provided in Section 2, and the algorithms are presented in Section 3. Section 3.1 contains the Monte Carlo algorithm. The Las Vegas algorithm is presented in 3.3. Generalization of the stochastic block model, as well as exciting future directions are discussed in Appendix A and B.
2. Lower Bound (Proof of Theorem 2)
In this section, we develop our information theoretic lower bounds. We prove a more general result from which Theorem 2 follows easily.
Consider the case when we have equally sized clusters of size each (that is total number of elements is ). Suppose we are allowed to make at most adaptive queries to the oracle. The probability of error for any algorithm for Query-Cluster is at least,
The main high-level technique to prove Lemma 1 is the following. Suppose, a node is to be assigned to a cluster. This situation is obviously akin to a
-hypothesis testing problem, and we want to use a lower bound on the probability of error. The side information and the query answers constitute a random vector whose distributions (among thepossible) must be far apart for us to successfully identify the clustering. But the main challenge comes from the interactive nature of the algorithm since it reveals deterministic information and into characterizing the set of elements that are not queried much by the algorithm.
Proof of Lemma 1.
Since the total number of queries is , the average number of queries per element is at most . Therefore there exist at least elements that are queried at most times. Let be one such element. We just consider the problem of assignment of to a cluster (assume, otherwise the clustering is done), and show that any algorithm will make wrong assignment with positive probability.
Step 1: Setting up the hypotheses. Note that, the side information matrix is provided where the s are independent random variables. Now assume the scenario when we use an algorithm ALG to assign to one of the clusters, . Therefore, given , ALG takes as input the random variables s where , makes some queries involving and outputs a cluster index, which is an assignment for . Based on the observations s, the task of ALG is thus a multi-hypothesis testing among hypotheses. Let denote the different hypotheses . And let
denote the joint probability distributions of the random matrixwhen . In short, for any event , . Going forward, the subscript of probabilities or expectations will denote the appropriate conditional distribution.
Step 2: Finding “weak” clusters. There must exist such that,
We now find a subset of clusters, that are “weak,” i.e., not queried enough if were true. Consider the set where . We must have, which implies,
Now, to output a cluster without using the side information, ALG has to either make a query to the actual cluster the element is from, or query at least times. In any other case, ALG must use the side information (in addition to using queries) to output a cluster. Let denote the event that ALG outputs cluster by using the side information. Let Since, we must have, This means, contains more than elements. Since there are elements that are queried at most times, these two sets must have nonzero intersection. Hence, we can assume that, for some , i.e., let be the true hypothesis. Now we characterize the error events of the algorithm ALG in assignment of .
Step 3: Characterizing error events for “”. We now consider the following two events.
Note that, if the algorithm ALG can correctly assign to a cluster without using the side information then either of or must have to happen. Recall, denotes the event that ALG outputs cluster using the side information. Now consider the event The probability of correct assignment is at most We now bound this probability of correct recovery from above.
Step 4: Bounding probability of correct recovery via Hellinger distance. We have,
where, denotes the total variation distance between two probability distributions and and in the last step we have used the relationship between total variation distance and the Hellinger divergence (see, for example, [35, Eq. (3)]). Now, recall that and
are the joint distributions of the independent random variables. Now, we use the fact that squared Hellinger divergence between product distribution of independent random variables are less than the sum of the squared Hellinger divergence between the individual distribution. We also note that the divergence between identical random variables are . We obtain
This is true because the only times when differs under and under is when or As a result we have, Now, using Markov inequality Therefore,
Therefore, putting the value of we get, which proves the lemma. ∎
Proof of Theorem 2.
Suppose, . Then , since . Also, we can take , since otherwise the theorem is already proved from the lower bound. Consider the situation when we are already given a complete cluster with elements, remaining clusters each has 1 element, and the rest elements are evenly distributed (but yet to be assigned) to the clusters. Now we are exactly in the situation of Lemma 1 with playing the role of . If we have , The probability of error is at least , where is a term that goes to with . Therefore must be . Note that, in this proof we have not in particular tried to optimize the constants.
If we want to recover the clusters with probability , then is a trivial lower bound. Hence, coupled with the above we get a lower bound of in that case. ∎
We propose two algorithms (Monte Carlo and Las Vegas) both of which are completely parameter free that is they work without any knowledge of and , and meet the respective lower bounds within an factor. We first present the Monte Carlo algorithm which drastically reduces the number of queries from (no side information) to and recovers the clusters exactly with probability at least . Next, we present our Las Vegas algorithm.
Our algorithm uses a subroutine called Membership that takes as input an element and a subset of elements Assume that are discrete distributions over fixed set of points ; that is takes value in the set Define the empirical “inter” distribution for Also compute the “intra” distribution for Then we use Membership() = as affinity of vertex to , where denotes the Hellinger divergence between distributions. Note that, since the membership is always negative, a higher membership implies that the ‘inter’ and ‘intra’ distributions are closer in terms of Hellinger distance.
Designing a parameter free Monte Carlo algorithm seems to be highly challenging as here, the number of queries depends only logarithmically with . Intuitively, if an element has the highest membership in some cluster , then should be queried with
first. Also an estimation from side information is reliable when the cluster already has enough members. Unfortunately, we know neither whether the current cluster size is reliable, nor we are allowed to make even one query per element.
To overcome this bottleneck, we propose an iterative-update algorithm which we believe will find more uses in developing parameter free algorithms. We start by querying a few points so that there is at least one cluster with points. Now based on these queried memberships, we learn two empirical distributions from intra-cluster similarity values, and from inter-cluster similarity values. Given an element which has not been clustered yet, and a cluster with the highest number of current members, we would like to consider the submatrix of side information pertaining to and all and determine whether that side information is generated from or . We know if the statistical distance between and is small, then we would need more members in to successfully do this test. Since, we do not know and , we compute the squared Hellinger divergence between and , and use that to compute a threshold on the size of . If crosses this size threshold, we just use the side information to determine if should belong to . Otherwise, we query further until there is one cluster with size , and re-estimate the empirical distributions and . Again, we recompute a threshold , and stop if the cluster under consideration crosses this new threshold. If not we continue. Interestingly, we can show when the process converges, we have a very good estimate of and, moreover it converges fast.
3.1. Monte Carlo Algorithm
The algorithm has several phases.
Phase 1. Initialization. We initialize the algorithm by selecting any vertex and creating a singleton cluster . We then keep selecting new vertices randomly and uniformly that have not yet been clustered, and query the oracle with it by choosing exactly one vertex from each of the clusters formed so far. If the oracle returns to any of these queries then we include the vertex in the corresponding cluster, else we create a new singleton cluster with it. We continue this process until at least one cluster has grown to a size of , where is an appropriately chosen constant depending on ††the precise value of can be deduced from the proof given .
The number of queries made in Phase 1 is at most .
We stop the process as soon as a cluster has grown to size of . Therefore, we may have clustered at most vertices at this stage, each of which may have required queries to the oracle, one for every cluster. ∎
Phase 2. Iterative Update. Let be the set of clusters formed after the th iteration for some , where we consider Phase as the -th iteration. We estimate
If there is no cluster of size at least formed so far, we select a new vertex yet to be clustered and query it exactly once with the existing clusters (that is by selecting one arbitrary point from every cluster and querying the oracle with the new vertex and the selected one), and include it in an existing cluster or create a new cluster with it based on the query answer. We then set and move to the next iteration to get updated estimates of and .
Else if there is a cluster of size at least , we stop and move to the next phase.
Phase 3. Processing the grown clusters. Once Phase has converged, let and be the final estimates. For every cluster of size , we call it grown and we do the following.
(3A.) For every unclustered vertex , if , then we include in without querying.
(3B.) We create a new list , initially empty. If
then we include in . For every vertex in , we query the oracle with it by choosing exactly one vertex from each of the clusters formed so far starting with . If oracle returns answer “yes” to any of these queries then we include the vertex in that cluster, else we create a new singleton cluster with it. We continue this until is exhausted.
We then call completely grown, remove it from further consideration, and move to the next grown cluster. if there is no other grown cluster, then we move back to Phase .
One of the important tools that will be used in this section is Sanov’s theorem from the large-deviation theory.
Lemma 2 (Sanov’s theorem).
Let are iid random variables with a finite sample space and distribution . Let denote their joint distribution. Let be a set of probability distributions on . The empirical distribution gives probability to any event . Then,
A continuous version of Sanov’s theorem is also possible, especially when the set is convex (as a matter of fact the polynomial term in front of the right hand side can be omitted in cerain cases), but we omit here for clarity. The Sanov’s theorem states, if we have an empirical distribution and a set of all distributions satisfying certain property , then the probability decreases exponentially with the minimum KL divergence of with any distribution in . Note that, the KL divergence in the exponent of the Sanov’s theorem naturally indicates an upper bound in terms of KL divergence. However, a major difficulty in dealing with KL divergence is that it is not a distance and does not satisfy triangle inequality. We overcome that by dealing with Hellinger distance instead.
There are two parts to the analysis, showing the clusters are correct with high probability and determining the query complexity.
With probability at least all of the following holds for an appropriately chosen constant
Let be a cluster which according to the updated estimates of and has crossed the updated threshold. Since , is estimated based on at least edges. We assume the largest cluster size in the input instance is at most ††We could have also assumed the largest cluster size is at most for some constant and adjust the constants appropriately.. Suppose the total number of vertices selected in Phase and Phase before grew to is strictly less than . Then the expected number of vertices selected from is at most . Then, by the Chernoff bound, the probability that the number of vertices selected from is is at most . Taking , we get with probability at least , the number of vertices chosen from outside is at least . Thus, is estimated based on at least edges.
Here, we use the following version of the Chrenoff bound††note that the version of the Chernoff bound also holds for sampling without replacement, which is the case here ..
Lemma 4 (The Chernoff Bound).
Let be independent random variable taking values in with . Let , and . Then the following holds
(a) Let . Now, select , where is a constant that ensures , also .
Here in the last step we have used Sanov’s theorem (see, Lemma 2). Using the relationship between KL-divergence and Hellinger distance, we get
where in the last step we used the optimization condition under the Sanov’s theorem. Setting , , we get . Let us take , and , we have and we get
(b) Following a similar argument as above, we get
Hence, by union bound all of (a), (b) and (c) hold with probability at least . ∎
Let be a cluster considered in Phase of size at least then the following holds with probability at least .
(a) If then is in
(b) If then
Suppose . Then for any , we have
Setting , we get (by noting the value of ), we get
Therefore, with at least probability (by applying union bound over all the following hold. (i) If then and (ii) If then .
(a) We have , that is . Suppose if possible . Then, we have
Then we have,
This contradicts that .
(b) Now assume but