Clustering is one of the most central problems in unsupervised learning. A clustering problem is typically represented by a set of elements together with a notion of similarity (or dissimilarity) between them. When the elements are points in a metric space, dissimilarity can be measured via a distance function. In more general settings, when the elements to be clustered are members of an abstract set, similarity is defined by an arbitrary symmetric function defined on pairs of distinct elements in . Correlation Clustering (CC) bansal2004correlation is a well-known special case where is a -valued function establishing whether any two distinct elements of are similar or not. The objective of CC is to cluster the points in so to maximize the correlation with . More precisely, CC seeks a clustering minimizing the number of errors, where an error is given by any pair of elements having similarity and belonging to the same cluster, or having similarity and belonging to different clusters. Importantly, there are no a priori limitations on the number of clusters or their sizes: all partitions of including the trivial ones are valid clusterings. Given and , the error achieved by an optimal clustering is known as the Correlation Clustering index, denoted by . A convenient way of representing is through a graph where iff . Note that is equivalent to a perfectly clusterable graph (i.e.,
is the union of disjoint cliques). Since its introduction, CC has attracted a lot of interest in the machine learning community, and has found numerous applications in entity resolutiongetoor2012entity, image analysis kim2011higher, and social media analysis tang2016survey. Known problems in data integration cohen2002learning and biology ben1999clustering can be cast into the framework of CC Sammut:2010. From a machine learning viewpoint, we are interested in settings when the binary similarity function defining a CC instance is not available beforehand, and a learning algorithm can query the value of on arbitrary pairs in
. This can be viewed as an active learning protocol, where the learner’s goal is to trade off the clustering error with the number of queries to. This setting is motivated by scenarios in which the similarity information is costly to obtain. For example, the decision on the content similarity between two documents may require a complex computation, and possibly the interaction with human experts.
|Running time||Expected clustering error||Reference|
|Q + LP solver + rounding||cesa2012correlation|
|Q||Theorem 1 (see also bonchi2013local)|
|Exponential ()||Theorem 7|
|Unrestricted ()||Theorem 8|
|Unrestricted ()||Theorem 9|
In this work we characterize the trade-off between the number of queries and the clustering error on points —see Table 1 for a summary of our results in the context of previous work. Recall that minimizing the correlation clustering error is APX-hard charikar2005clustering, and the best efficient algorithm found so far achieves Chawla:2015. This almost matches the best possible approximation factor achievable by LP methods charikar2005clustering. A very simple and elegant query-based algorithm for approximating CC is KwikCluster Ailon2008. In each round , the algorithm draws a random pivot from and queries the similarities between and every other . Then, a cluster is created containing the pivot and all the points with positive similarity with the pivot, . The algorithm is then recursively invoked on . On any instance of CC, KwikCluster achieves an expected error bounded by . However, it is easy to see that the number of queries made by KwikCluster is in expectation, where and is the expected number of clusters found, which is in the worst case (e.g., if is the constant function and thus ).
Our first contribution is a variant of KwikCluster, which we call , with an expected clustering error of , where is a deterministic bound on the number of queries. When , reduces to KwikCluster, and our analysis recovers KwikCluster’s bound on the expected clustering error. Representing as a graph with edges between similar pairs, we also prove that natively yields low error on a per-cluster basis, for all clusters that are -knit; that is, all clusters that are cliques except for a constant fraction of spurious edges (internal to the clique or leaving the clique). In particular, for any -knit cluster there is a cluster in the clustering output by such that , where denotes symmetric difference. This means one can use as a cluster-recovery algorithm even against adversarial perturbations of the input. Under stronger conditions on , we also show that via independent executions of one can recover exactly all large enough clusters with high probability. Next, we show a variant of that guarantees the desired number of queries only in expectation as opposed to deterministically. Our variant has the same expected clustering error as but makes significantly less queries than on some graphs. For example, when and there are similar pairs, the expected number of queries made by is only the square root of the queries made by .
We then move on to the study of trade-offs between queries and clustering error that ignore computational efficiency. Using VC theory, for all we prove that the strategy of minimizing disagreements on a random subset of pairs achieves, with high probability, clustering error bounded by , which reduces to when . We complement these results with two information-theoretic lower bounds showing that any algorithm issuing queries, possibly chosen in an adaptive way, must suffer an expected clustering error of at least , and at least when . Note that the upper bound matches the lower bound for . When , instead, there is still a gap between upper and lower bounds.
The VC theory approach can also be applied to any efficient approximation algorithm. The catch is that the approximation algorithm cannot ask the similarity of arbitrary pairs, but only of pairs included in the random sample of edges. The best known approximation factor in this case is demaine2006correlation, which gives a clustering error bound of with high probability. This was already observed in cesa2012correlation albeit in a slightly different context.
2 Related work
The closest work to ours is bonchi2013local, where they propose a different variant of KwikCluster. Their variant works by running KwikCluster on a random subset of nodes and storing the set of resulting pivots. Then, each node is assigned to the cluster identified by the pivot with smallest index and such that . If no such pivot is found, then becomes a singleton cluster. According to (bonchi2013local, Lemma 4.1), the expected clustering error for this variant is , which can be compared to our bound for by setting . On the other hand our algorithms are much simpler and significantly easier to analyze. This allows us to prove a set of additional important properties that our algorithms exhibit, such as cluster recovery and instance-dependent bounds on the expected number of queries. It is unclear whether these results are obtainable with the techniques of bonchi2013local.
The work mazumdar2017clustering considers the case in which there is a latent clustering with —see also tsourakakis2017predicting for the case where the latent clustering has two clusters only. The algorithm can issue pairwise binary queries to know whether and belong to the same cluster, for all pairs . However, the oracle is noisy: each query is answered incorrectly with some probability (which can depend on the correct answer), and the noise is persistent (repeated queries give the same noisy answer). Our setting is strictly harder because our oracle has a budget of adversarially incorrect answers.
The above setting is closely related to the stochastic block model (SBM), which is a well-studied model for cluster recovery abbe2015community; massoulie2014community; mossel2018proof. However, only few works investigate SBMs with pairwise queries chen2016community. A more general model, including queries on triplets of points, is considered in vinayak2016crowdsourced.
A different model is edge classification also known as signed edge prediction. Here the algorithm is given a graph with hidden binary labels on the edges. The task is to predict the sign of all edges by querying as few labels as possible cesa2012correlation; chen2014clustering; chiang2014prediction. As before, the oracle can have a budget of incorrect answers, or a latent clustering with is assumed and the oracle’s answers are affected by persistent noise. Unlike correlation clustering, in edge classification the algorithm is not constrained to predict in agreement with a partition of the nodes. On the other hand, the algorithm cannot query arbitrary pairs of nodes in , but only those that form an edge in .
Preliminaries and notation.
is the initial set of nodes. A clustering is a partition of in disjoint clusters . An assignment of labels to pairs of nodes is specified by a function , where is the set of all pairs of distincts nodes in . Given a clustering and a labeling , the set of mistaken edges contains all pairs such that and belong to same cluster of and all pairs such that and belong to different clusters of . The cost of a clustering is . The correlation clustering index is then , where the minimum is over all clusterings of . We often view as a graph where is an edge if and only if . In this case, for any subset we let be the subgraph of induced by , and for any we let be the neighbor set of .
Given a labeling and three distinct nodes , we say that is a bad triangle if and only if the labels on the three pairs are (the order is irrelevant). We denote by the set of all bad triangles in . Note that is at least the number of edge-disjoint bad triangles. If is a triangle and is one of its edges, we write and .
Due to space limitations, here most of our results are stated without proof, or with a concise proof sketch; the full proofs can be found in the supplementary material.
3 The algorithm
We introduce our active learning algorithm (Active Correlation Clustering).
has the same recursive structure as KwikCluster. First, it starts with the full instance . Then, for each round it selects a random pivot , queries the similarities between and a subset of , removes and possibly other points from , and proceeds on the remaining residual subset . However, while KwikCluster queries for all , queries only other nodes (lines 6–7), where . Thus, while KwikCluster always finds all positive labels involving the pivot , can find them or not, with a probability that depends on . The function is called query rate function and dictates the tradeoff between the clustering cost and the number of queries , as we prove below. Now, if any of the aforementioned queries returns a positive label (line 8), then all the labels between and the remaining are queried and the algorithm operates as KwikCluster until the end of the recursive call; otherwise, the pivot becomes a singleton cluster which is removed from the set of nodes. Another important difference is that deterministically stops after recursive calls (line 2), declaring all remaining points as singleton clusters. The intuition is that with good probability the clusters not found within rounds are small enough to be safely disregarded. Since the choice of is delicate, we shall avoid trivialities by assuming is positive, integral, and smooth enough. Formally:
is a query rate function if and for all . This implies for all .
We can now state formally our bounds for .
For any query rate function and any labeling on nodes, the expected cost of the clustering output by satisfies
The number of queries made by is deterministically bounded as . In the special case for all , reduces to and achieves with .
Note that Theorem 1 gives an upper bound on the error achievable by using queries: since , the expected error is at most .
Look at a generic round , and consider a pair of points . The essence is that can misclassify in one of two ways. First, if , can choose as pivot a node such that . In this case, if the condition on line 8 holds, then will cluster together with and , thus mistaking . If instead , then could mistake by pivoting on a node such that and , and clustering together only and . Crucially, both cases imply the existence of a bad triangle . We charge each such mistake to exactly one bad triangle , so that no triangle is charged twice. The expected number of mistakes can then be bound by using the packing argument of Ailon2008 for KwikCluster. Second, if then could choose one of them, say , as pivot , and assign it to a singleton cluster. This means the condition on line 8 fails. We can then bound the number of such mistakes as follows. Suppose has positive labels towards for some . Loosely speaking, we show that the check of line 8 fails with probability , in which case mistakes are added. In expectation, this gives mistakes. Over all rounds, this gives an overall
. (The actual proof has to take into account that all the quantities involved here are not constants, but random variables).
3.1 with Early Stopping Strategy
We can refine our algorithm so that, in some cases, it takes advantage of the structure of the input to reduce significantly the expected number of queries. To this end we see the input as a graph with edges corresponding to positive labels (see above). Suppose then contains a sufficiently small number of edges. Since deterministically performs rounds, it could make queries. However, with just queries one could detect that contains edges, and immediately return the trivial clustering formed by all singletons. The expected error would obviously be at most , i.e. the same of Theorem 1. More generally, at each round with queries one can check if the residual graph contains at least edges; if the test fails, declaring all nodes in as singletons gives expected additional error . The resulting algorithm is a variant of that we call ( with Early Stopping Strategy). The pseudocode can be found in the supplementary material.
First, we show gives guarantees virtually identical to (only, with in expectation). Formally:
For any query rate function and any labeling on nodes, the expected cost of the clustering output by satisfies
Moreover, the expected number of queries performed by is .
Theorem 2 reassures us that is no worse than . In fact, if most edges of belong to relatively large clusters (namely, all but edges), then we can show uses much fewer queries than (in a nutshell, quickly finds all large clusters and then quits). The following theorem captures the essence. For simplicity we assume , i.e. is a disjoint union of cliques.
Suppose so is a union of disjoint cliques. Let be the cliques of in nondecreasing order of size. Let be the smallest such that , and let . Then makes in expectation queries.
4 Cluster recovery
In the previous section we gave bounds on , the expected total cost of the clustering. However, in applications such as community detection and alike, the primary objective is recovering accurately the latent clusters of the graph, the sets of nodes that are “close” to cliques. This is usually referred to as cluster recovery. For this problem, an algorithm that outputs a good approximation of every latent cluster is preferrable to an algorithm that minimizes globally. In this section we show that natively outputs clusters that are close to the latent clusters in the graph, thus acting as a cluster recovery tool. We also show that, for a certain type of latent clusters, one can amplify the accuracy of via independent executions and recover all clusters exactly with high probability.
To capture the notion of “latent cluster”, we introduce the concept of -knit set. As usual, we view as a graph with iff . Let be the edges in the subgraph induced by and be the edges between and .
A subset is -knit if and .
Suppose now we have a cluster
as “estimate” of. We quantify the distance between and as the cardinality of their symmetric difference, . The goal is to obtain, for each -knit set in the graph, a cluster with for some small . We prove does exactly this. Clearly, we must accept that if is too small, i.e. , then will miss entirely. But, for , we can prove . We point out that the property of being -knit is rather weak for an algorithm, like , that is completely oblivious to the global topology of the cluster — all what tries to do is to blindly cluster together all the neighbors of the current pivot. In fact, consider a set formed by two disjoint cliques of equal size. This set would be close to -knit, and yet would never produce a single cluster corresponding to . Things can only worsen if we consider also the edges in , which can lead to assign the nodes of to several different clusters when pivoting on . Hence it is not obvious that a -knit set can be efficiently recovered by .
Note that this task can be seen as an adversarial cluster recovery problem. Initially, we start with a a disjoint union of cliques, so that . Then, an adversary flips the signs of some of the edges of the graph. The goal is to retrieve every original clique that has not been perturbed excessively. Note that we put no restriction on how the adversary can flip edges; therefore, this adversarial setting subsumes constrained adversaries. For example, it subsumes the stochastic block model HollandSBM where within-cluster and between-cluster edges are flipped according to some distribution.
We can now state our main cluster recovery bound for .
For every that is -knit, outputs a cluster such that .
The in the bound captures two different regimes: when is very close to , then independently of the size of , but when we need , i.e., must be large enough to be found by .
4.1 Exact cluster recovery via amplification
For certain latent clusters, one can get recovery guarantees significantly stronger than the ones given natively by (see Theorem 4). We start by introducing the notion of strongly -knit set. Recall that is the neighbor set of in the graph induced by the positive labels.
A subset is strongly -knit if, for every , we have and .
We immediately remark that alone does not give better guarantees on strongly -knit subsets than on -knit subsets. Suppose for example that each has . Then is strongly -knit, and yet when pivoting on any will inevitably produce a cluster with , since the pivot has edges to less than other nodes of .
Interestingly, we can overcome this limitation by running several times with a simple cluster tagging rule followed by a majority vote. Recall that . Then, we define the id of a cluster as the smallest node of . The min-tagging rule is the following: when forming , use its id to tag all of its nodes. Therefore, if is the id of , we will set for every . Consider now the following algorithm, called (Amplified Cluster Recovery). First, performs independent runs of on input , using the min-tagging rule on each run. In this way, for each we obtain tags , one for each run. Thereafter, for each we select the tag that has received most often, breaking ties arbitrarily. Finally, nodes with the same tag are clustered together. One can prove that, with high probability, this clustering contains all strongly -knit sets. In other words, with high probability recovers all such latent clusters exactly. Formally, we prove:
Let and fix . If is run with , then the following holds with probability at least : for every strongly -knit with , the algorithm outputs a cluster such that .
It is not immediately clear that one can extend this result by relaxing the notion of strongly -knit set so to allow for edges between and the rest of the graph. We just notice that, in that case, every node could have a neighbor that is smaller than every node of . In this case, when pivoting on would tag with rather than with , disrupting .
5 A fully additive scheme
In this section, we introduce a(n inefficient) fully additive approximation algorithm achieving cost in high probability using order of queries. When , suffices. Our algorithm combines uniform sampling with empirical risk minimization and is analyzed using VC theory.
First, note that CC can be formulated as an agnostic binary classification problem with binary classifiersassociated with each clustering of (recall that denotes the set of all pairs of distinct elements ), and we assume iff and belong to the same cluster of . Let be the set of all such . The risk of a classifier
with respect to the uniform distribution overis where is drawn u.a.r. from . It is easy to see that the risk of any classifier is directly related to , . Hence, in particular, . Now, it is well known —see, e.g., (Shalev-Shwartz:2014:UML:2621980, Theorem 6.8)— that we can minimize the risk to whithin an additive term of using the following procedure: query edges drawn u.a.r. from , where is the VC dimension of , and find the clustering such that makes the fewest mistakes on the sample. If there is with zero risk, then random queries suffice. A trivial upper bound on the VC dimension of is . The next result gives the exact value.
The VC dimension of the class of all partitions of elements is .
Let be the VC dimension of . We view an instance of CC as the complete graph with edges labelled by . Let be any spanning tree of . For any labeling , we can find a clustering of such that perfectly classifies the edges of : simply remove the edges with label in and consider the clusters formed by the resulting connected components. Hence because any spanning tree has exactly edges. On the other hand, any set of edges must contain at least a cycle. It is easy to see that no clustering makes consistent with the labeling that gives positive labels to all edges in the cycle but one. Hence . ∎
An immediate consequence of the above is the following.
There exists a randomized algorithm that, for all , finds a clustering satisfying with high probability while using queries. Moreover, if , then queries are enough to find a clustering satisfying .
6 Lower bounds
In this section we give two lower bounds on the expected clustering error of any (possibly randomized) algorithm. The first bound holds for , and applies to algorithms using a deterministically bounded number of queries. This bound is based on a construction from (cesa2015complexity, Lemma 11) and related to kernel-based learning.
For any such that is an even integer, and for every (possibly randomized) learning algorithm asking fewer than queries with probability , there exists a labeling on nodes such that and the expected cost of the algorithm is at least .
Our second bound relaxed the assumption on . It uses essentially the same construction of (bonchi2013local, Lemma 6.1), giving asymptotically the same guarantees. However, the bound of bonchi2013local applies only to a very restricted class of algorithms: namely, those where the number of queries involving any specific node is deterministically bounded. This rules out a vast class of algorithms, including KwikCluster, , and , where the number of queries involving a node is a function of the random choices of the algorithm. Our lower bound is instead fully general: it holds unconditionally for any randomized algorithm, with no restriction on what or how many pairs of points are queried.
For every such that and for every (possibly randomized) learning algorithm, there exists a labeling on nodes such that the algorithm has expected error whenever its expected number of queries satisfies .
Note that the bound can be put in the form for every by adapting the constants (see the proof). It is then easy to see that and are essentially optimal.
We tested on six datasets from NIPS2017_7161; NIPS2017_7054. Four of these datasets are obtained from real-world data and the remaining two are synthetic. In Figure 1
we show our results for one real-world dataset (cora, with 1879 nodes and 191 clusters) and one synthetic dataset (skew, with 900 nodes and 30 clusters). Similar results for the remaining four datasets can be found in the supplementary material. Every dataset provides a ground-truth partitioning of nodes with. To test the algorithm for , we perturbed the dataset by flipping the label of each edge indipendently with probability (so the results for refer to the original dataset with ).
. The circular outliers mark the performance of KwikCluster.
Figure 1 shows the measured clustering cost against the number of queries performed by . For each value of , each curve in the plot is obtained by setting the query rate to for distinct values of . For each value of we ran fifty times. The curve shows the average value of
(standard deviations, which are small, are omitted to avoid cluttering the figure). The circle marker shows the performance of KwikCluster (the circular outlier marker). On both datasets, the error ofshows a nice sublinear drop as the number of queries increases, quickly approaching the performance of KwikCluster. Ignoring lower order terms, Theorem 1 gives an expected cost bounded by about for the case (recall that is unknown). Placing this curve in our plots, shows that is a factor of two or three better than the theoretical bound (which is not shown in Figure 1 due to scaling issues).
Appendix A Probability bounds
We give Chernoff-type probability bounds can be found in e.g. Dubhashi2009 and that we repeatedly use in our proofs. Let be binary random variables. We say that are non-positively correlated if for all we have:
The following holds:
Let be independent or, more generally, non-positively correlated binary random variables. Let and . Then, for any , we have:
Appendix B Supplementary Material for Section 3
b.1 Pseudocode of
b.2 Proof of Theorem 1
We refer to the pseudocode of ( Algorithm 2). We use to denote the set of remaining nodes at the beginning of the -th recursive call. Hence . If the condition in the if statement on line 8 is not true, then is a singleton cluster. We denote by the set nodes that are output as singleton clusters.
Let be the set of mistaken edges for the clustering output by and let be the cost of this clustering. Note that, in any recursive call, misclassifies an edge if and only if is part of a bad triangle whose third node is chosen as pivot and does not become a singleton cluster, or if and at least one of becomes a singleton cluster. More formally, misclassifies an edge if and only if one of the following three disjoint events holds:
There exists and a bad triangle such that and .
There exists such that with and .
The algorithm stops after calls without removing neither nor , and .
Therefore the indicator variable for the event “ is mistaken” is:
The expected cost of the clustering is therefore:
We proceed to bound the three terms separately.
Fix an arbitrary edge . Note that, if occurs, then is unique, i.e. exactly one bad triangle in satisfies the definition of . Each occurrence of can thus be charged to a single bad triangle . We may thus write
where . Let us then bound . Let . We use the following fact extracted from the proof of [Ailon2008, Theorem 6.1]. If is a set of weights on the bad triangles such that for all , then . Given and , let be the event corresponding to being the first triangle in the set such that and for some . Now if holds then holds and no other for holds. Therefore
If holds for some , then it cannot hold for any other because implies that for all we have implying . Hence, given that holds for , if holds too, then it holds for the same by construction. This implies that because chooses the pivot u.a.r. from the nodes in . Thus, for each we can write
Choosing we get .
In the proof of KwikCluster, the condition was ensured by considering events . Indeed, in KwikCluster the events are disjoint, because holds iff is the first and only triangle in whose node opposite to is chosen as pivot. For this is not true because a pivot can become a singleton cluster, which does not cause necessarily to hold.
For any , let . We have that
Taking expectations with respect to the randomization of ,
For any round , let be the sequence of random draws made by the algorithm before round . Then if either , or and . Otherwise,