Correlation clustering is a graph clustering problem where we are given similarity or dissimilarity information for pairs of vertices. The input is a graph on vertices. Edges of are labeled as similar (positive) or dissimilar (negative). The clustering objective is to partition the vertices into clusters such that edges labeled ‘positive’ remain within clusters and ‘negative’ edges go across clusters. However, this similarity/dissimilarity information may be inconsistent with this objective. For example, there may exist vertices such that edges and are labeled ‘positive’ whereas edge is labeled ‘negative’. In this case, it is not possible to come up with a clustering of these vertices that would agree with all the edge labels. The objective of correlation clustering is to come up with a clustering that minimises disagreement or maximises agreement with the edge labels given as input. The minimisation version of the problem, known as , minimises the sum of the number of negative edges present inside clusters and the number of positive edges going across clusters. Similarly, the maximisation version is known as where the objective is to maximise the sum of the number of positive edges present inside clusters and the number of negative edges going across clusters. Unlike -means or -median clustering, in correlation clustering, there is no restriction on the number of clusters formed by the optimal clustering. When the number of optimal clusters is given to be at most , these problems are known as and respectively.
Bansal et al. [BBC2004] gave a constant approximation algorithm for and a PTAS for . Subsequently, Charikar et al. [CGW2005] improved approximation guarantee for to , and showed that is -hard. These results are for correlation clustering on complete graphs as it is known for general graphs, it is at least as hard as minimum multi-cut problem [CGW2005]. Since is -hard [CGW2005], additional assumptions were introduced for better results. For example [MS2010, MMV2015] studied where the input is noisy and comes from a semi-random model. When is given as part of the input, Giotis and Guruswami [GG2006] gave a PTAS for .
Recently there have been some works [MM2016, Bl2016] with a beyond-worst case flavour where polynomial time algorithms for -hard problems have been designed under some stability assumptions. Ashtiani et al. [AKB2016] considered one such stability assumption called -margin
. They introduced a semi-supervised active learning (SSAC) framework and within this framework, gave a probabilistic polynomial time algorithm for-means on datasets that satisfy the -margin property. More specifically, their SSAC framework involves a query oracle that answers queries of the form “given any two vertices, do they belong to the same optimal cluster?”. The query oracle responds with a Yes/No answer where these answers are assumed to be consistent with some fixed optimal solution. In this framework, they studied the query complexity for polynomial time algorithms for -means on datasets satisfying the -margin property. Ailon et al. [ABJK2017] extended this work to study query complexity bounds for -approximation for -means in SSAC framework for any small without any stability assumption on the dataset. They gave almost matching upper and lower bounds on the number of queries for -approximation of -means problem in SSAC framework.
In this work, we study in the SSAC framework, where the optimal clustering has at most clusters and give upper and lower bounds on the number of same-cluster queries for -approximation for correlation clustering for any . We also give upper bounds for . Our algorithm is based on the PTAS by Giotis and Guruswami [GG2006] for . The algorithm by Giotis and Guruswami involves random sampling a subset of vertices and considers all possible ways of partitioning into clusters , and for every such -partitioning, clusters the rest of the vertices greedily. Every vertex is assigned a cluster that maximizes its agreement with the edge labels. Their main result was the following.
Theorem 1.1 (Giotis and Guruswami [Gg2006])
For every , there is a PTAS for with running time .
Since Giotis and Guruswami considered all possible ways of partitioning subset into clusters, their running time has exponential dependence on . Here, we make the simple observation that within the SSAC framework we can overcome this exponential dependence on by making same-cluster queries to the oracle. The basic idea is to randomly sample a subset of vertices as before and partition it optimally into clusters by making same-cluster queries to the oracle. Note that by making at most same-cluster queries, one can partition optimally into clusters. Once we have subset partitioned as in the optimal clustering (a key step needed in the analysis of Giotis and Guruswami) we follow their algorithm and analysis for -approximation for . Here is our main result for in the SSAC framework. We obtain similar results for .
Theorem 1.2 (Main result: Upper bound)
Let and . There is a (randomized) algorithm in the SSAC framework for that uses same-cluster queries, runs in time and outputs a -approximate solution with high probability.
-approximate solution with high probability.
We complement our upper bound result by providing a lower bound on the number of queries in the SSAC framework for any efficient -approximation algorithm for for any . Our lower bound result is conditioned on the Exponential Time Hypothesis (ETH hypothesis) [IP01, IPZ01]. Our lower bound result implies that the number of queries is depended on the number of optimal clusters . Our main result with respect to query lower bound is given as follows.
Theorem 1.3 (Main result: Lower bound)
Given that Exponential Time Hypothesis (ETH) holds, there exists a constant such that any -approximation algorithm for in the SSAC framework that runs in polynomial time makes same-cluster queries.
Exponential Time Hypothesis is the following statement regarding the hardness of the - problem.
Exponential Time Hypothesis (ETH)[IP01, IPZ01]: There does not exist an algorithm that can decide whether any - formula with clauses is satisfiable with running time .
Note that our query lower bound result is a simple corollary of the following theorem that we prove.
If the Exponential Time Hypothesis (ETH) holds, then there exists a constant such that any -approximation algorithm for requires time.
The above lower bound statement may be of independent interest. It was already known that is -hard. Our result is a non-trivial addition to the understanding of the hardness of the correlation clustering problem. Given that our query upper bound result is through making simple observations in the algorithms of Giotis and Guruswami, our lower bound results may be regarded as the primary contribution of this work. So, we first give our lower bound results in the next section and the upper bound results in Section 3. However, before we start discussing our results, here is a brief discussion on the related works.
There have been numerous works on clustering problems in semi-supervised settings. Balcan and Blum [BB2008] proposed an interactive framework for clustering which use ‘split/merge’ queries. In this framework, given any abritrary clustering as query, oracle specifies some cluster should be split or clusters and should be merged. Awasthi et al. [ABV2014] developed a local clustering algorithm which uses these split/merge queries. One versus all queries for clustering were studied by Voevodski et al. [VBRTX2014]. The oracle, on a query , returns distances from to all points in . The authors provided a clustering, close to optimal -median clustering, with only such queries on instances satisfying -approximation stability property [BBG2013]. Fomin et al. [FKPPV2014] gave a conditional lower bound for the cluster editing problem which can also be stated as a decision version of the correlation clustering problem. In the -cluster editing problem, given a graph and a budget , and an integer , the objective is to decide whether can be transformed into a union of clusters (disjoint cliques) using at most edge additions and deletions. Assuming ETH, they showed that there exists for some such that there is no algorithm that decides in time whether can be transformed into a union of cliques using at most adjustments (edge additions and deletions). It is not clear whether their exact reduction can be modified into an approximation preserving reduction to obtain results similar to what we have here. Mazumdar and Saha [MS2017] studied correlation clustering problem in a similar setting where edge similarity and dissimilarity information are assumed to be coming from two distributions. Given such an input, they studied the cluster recovery problem in SSAC framework, and gave upper and lower bounds on the query complexity. Their lower bound results are information theoretic in nature. We are, however, interested in the approximate solutions for the correlation clustering problem.
2 Query Lower Bounds
In this section, we obtain a lower bound on the number of same-cluster queries that any FPTAS within the SSAC framework needs to make for the problem . We derive a conditional lower bound for the minimum number of queries under the Exponential Time Hypothesis (ETH) assumption. Some such conditional lower bound results based on ETH can be found in [M16]. We prove the following main theorem in this section.
If the Exponential Time Hypothesis (ETH) holds, then there exists a constant such that any -approximation algorithm for requires time.
The above theorem gives a proof of Theorem 1.3.
Proof (Proof of Theorem 1.3)
Let us assume that there exists a query-FPTAS that makes only same-cluster queries. Then, by considering all possible answers for these queries and picking the best solution, one can solve the problem in time which contradicts Theorem 2.1.
In the remaining section, we give the proof of Theorem 2.1. First, we state the ETH hypothesis. Our lower bound results are derived assuming this hypothesis.
Hypothesis 1 (Exponential Time Hypothesis (ETH)[IP01, IPZ01]): There does not exist an algorithm that decides whether any - formula with clauses is satisfiable with running time .
Since we would like to obtain lower bounds in the approximation domain, we will need a gap version of the above ETH hypothesis. The following version of the PCP theorem would be very useful in obtaining a gap version of ETH.
Theorem 2.2 (Dinur’s PCP Theorem [D07])
For some constants , there exists a polynomial-time reduction that takes a - formula with clauses as input and produces one -111Every clause in an - formula has exactly literals. formula with clauses such that
if is satisfiable, then is satisfiable, and
if is unsatisfiable, then , and
each variable in appears in at most clauses.
where is the maximum fraction of clauses of which are satisfiable by any assignment.
The hypothesis below follows from ETH and the above Theorem 2.2, and will be useful for our analysis.
Hypothesis 2: There exists constants such that the following holds: There does not exist an algorithm that, given a - formula with clauses and each variable appearing in at most clauses, distinguishes whether is satisfiable or , and runs in time better than .
The lemma given below trivially follows from Dinur’s PCP Theorem 2.2.
If Hypothesis 1 holds, then so does Hypothesis 2.
We now give a reduction from the gap version of the - problem to the gap version of the - problem. A problem instance of - consists of a set of clauses (each containing exactly literals) and a clause is said to be satisfied by an assignment iff at least one and at most two literals in the clause is true ( stands for “Not All Equal"). For any instance , we define to be the maximum fraction of clauses that can be satisfied in the “not all equal" sense by an assignment. Note that this is different from which is equal to the maximum fraction of clauses that can be satisfied (in the usual sense). First, we reduce - to - and then - to -.
Let and . There is a polynomial time reduction that given an instance of - with clauses with each variable appearing in at most clauses, produces an instance of - with clauses such that
If , then , and
If , then , and
Each variable in appears in at most clauses.
We construct in the following manner: for every variable in , we introduce two variables and . We will use iff for every in our reduction. For every clause (with being literals), we introduce the following four NAE clauses in :
For any index (say ), if (that is, the variable is in the positive form), then and . On the other hand, if , then and . So for example, for the clause in , we have the following four clauses:
Note that property (3) of the lemma holds due to our construction. For property (1), we argue that for any satisfying assignment for , the assignment of variables in as per the rule iff is a satisfying assignment of (in the NAE sense). This is because for every literal that makes a clause in true, the two corresponding copies and satisfies all the four clauses (in the NAE sense). For property (2), we prove the contrapositive. Suppose there is an assignment to the variables in that satisfies at least fraction of the clauses. We will argue that the assignment to the variables of as per the rule iff satisfies at least fraction of clauses of . First, note that for every set of clauses in created from a single clause of , either of them are satisfied or all four are satisfied (whatever the assignment of the variable be). Let be the number of these 4-sets where all 4 clauses are satisfied and let be the number of these 4-sets where 3 clauses are satisfied, where . Then we have which implies that . Note that for any of the 4-sets where all 4 clauses are satisfied, the corresponding clause in is satisfied with respect to the assignment as per rule iff (since at least one the pairs will have opposite values). So, the fraction of the clauses satisfied in is at least .
Let and . There is a polynomial time reduction that given an instance of - with clauses and with each variable appearing in at most clauses, produces an instance of - with clauses such that:
If , then , and
If , then .
Each variable in appears in at most clauses.
For every clause in , we construct the following four clauses in (let us call it a 4-set): , introducing new variables . Property (3) trivially holds for this construction. For every satisfying assignment for , there is a way to set the clause variables for every such that all four clauses in the 4-set corresponding to clause are satisfied. So, property (1) holds. We show property (2) using contraposition. Consider any assignment of that satisfies at least fraction of the clauses. Let denote the number of 4-sets such that as per this assignment out of clauses are satisfied. Then, we have . This implies that: which implies that . Now, note that for any 4-set such that all four clauses are satisfied, the corresponding clause in is satisfied by the same assignment to the variables. This implies that there is an assignment that makes at least fraction of clauses true in .
We come up with the following hypothesis which holds given that Hypothesis 2 holds, and is crucial for our analysis.
Hypothesis 3: There exists constants such that the following holds: There does not exist an algorithm that, given a - formula with clauses with each variable appearing in at most clauses, distinguishes whether or , and runs in time better than .
If Hypothesis 2 holds, then so does Hypothesis 3.
We now give a reduction from the gap version of - to the gap version of monotone - that has no negative variables. Note that because of the NAE (not all equal) property, setting all variables to does not necessarily satisfy the formula.
Let and . There is a polynomial time reduction that given an instance of - with clauses and with each variable appearing in at most clauses, produces an instance of monotone - with clauses such that:
If , then , and
If , then .
Each variable in appears in at most clauses.
We construct in the following manner: Substitute all positive literals of the variable with and all negative literals with for new variables . Also, for every variable , add the following clauses:
where for are new variables. Note that the only way to satisfy all the above clauses is to have . Let denote the total number of clauses in . So, . Also, from the construction, each variable in appears in at most clauses. This proves property (3). Property (1) follows from the fact that for any satisfying assignment for , there is a way to extend this assignment to variables in such that all clauses are satisfied. For all , and . All the new variables can be set so as to make all the new clauses satisfied.
We argue property (2) using contraposition. Suppose there is an assignment to variables in that makes at least fraction of clauses satisfied. First, note that there is also an assignment that makes at least fraction of the clauses satisfied and in which for all , . This is because out of of the following clauses can be satisfied when :
However, if we flip one of , then the number of above clauses satisfied can be and we might lose out on at most clauses since a variable appears in at most clauses in . Let be the number of clauses corresponding to the original clauses that are satisfied with this assignment. So, we have which gives:
This completes the proof of the lemma.
We come up with the following hypothesis which holds given that Hypothesis 3 holds.
Hypothesis 4: There exists constants such that the following holds: There does not exist an algorithm that, given a monotone - formula with clauses with each variable appearing in at most clauses, distinguishes whether or , and runs in time better than .
The lemma below follows easily from Lemma 5 mentioned in above.
If Hypothesis 3 holds, then so does Hypothesis 4.
We provide a reduction from the gap version of monotone - to a gap version of -colorability of -uniform bounded degree hypergraph.
Let and . There exists a polynomial time reduction that given a monotone - instance with clauses and with every variable appearing in at most clauses, outputs an instance of -uniform hypergraph with vertices and hyperedges and with bounded degree such that if is satisfiable, then is -colorable, and if at most -fraction of clauses of are satisfiable, then any -coloring of would have at most -fraction of edges that are bichromatic.
The reduction constructs a hypergraph as follows. The set of vertices correspond to the set of variables (all of them positive literals) of the monotone - instance . The set of edges correspond to the set of clauses all of which have literals, and therefore every hyperedge is of size . The resulting hypergraph is -uniform, and since every variable appears in at most clauses, the hypergraph is of bounded degree , and and . If there exists a satisfying assignment for , then every edge in is bichromatic and the hypergraph would be -colorable, and if at most -fraction of clauses are satisfiable by any assignment, then at most -fraction of edges of are bichromatic.
We come up with the following hypothesis which holds given that Hypothesis 4 holds.
Hypothesis 5: There exists constants such that the following holds: There does not exist an algorithm that, given a -uniform hypergraph with vertices and where every vertex has degree at most , distinguishes whether is bichromatic or at most -fraction of edges are bichromatic, and runs in time better than .
The lemma below follows easily from Lemma 7 above.
If Hypothesis 4 holds, then so does Hypothesis 5.
We now give a reduction from -colorability in -uniform hypergraph with constant bounded degree to a correlation clustering instance on a complete graph . We use the reduction as given in [CGW2005] for our purposes.
Lemma 9 ([Cgw2005])
Let . There is a polynomial-time reduction that given a -uniform hypergraph with vertices and where each vertex appears in at most hyperedges, outputs an instance of the correlation clustering problem where the graph has vertices and edges with edges in are labeled as ‘positive’ and all the other edges in the complete graph on vertices are labeled as ‘negative’ such that the following holds:
If is 2-colorable, then the cost of the optimal correlation clustering is , and
If at most -fraction of hyperedges of are bi-chromatic, then the optimal cost of correlation clustering is at least , where is some constant.
We come up with the following hypothesis which holds given that Hypothesis 5 holds.
Hypothesis 6: There exists constants such that the following holds: There does not exist a -factor approximation algorithm for the problem that runs in time better than .
The lemma below follows easily from Lemma 9 given above.
If Hypothesis 5 holds, then so does Hypothesis 6.
3 Algorithms for and in SSAC Framework
In this section, we give -approximation algorithms for the and problems within the SSAC framework for any .
In this section, we will discuss a query algorithm that gives -approximation to the problem. The algorithm that we will discuss is closely related to the non-query algorithm for by Giotis and Guruswami. See Algorithm MaxAg() in [GG2006]. In fact, except for a few changes, this section will look extremely similar to Section 3 in [GG2006]. Given this, it will help if we mention the high-level idea of the Giotis-Guruswami algorithm and point out the changes that can be made within the SSAC framework to obtain the desired result. The algorithm of Giotis and Guruswami proceeds in iterations, where . The given dataset is partitioned into equal parts , and in the iteration, points in are assigned to one of the clusters. In order to cluster in the iteration, the algorithm samples a set of data points , and for all possible -partitions of , it checks the agreement of a point with the clusters of . Suppose for a particular clustering of , the agreement of vertices in is maximised. Then the vertices in are clustered by placing them into the cluster that maximises their agreement with respect to . Trying out all possible -partitions of is an expensive operation in the Giotis-Guruswami algorithm (since the running time becomes ). This is where the same-cluster queries help. Instead of trying out all possible -partitions of , we can make use of the same-cluster queries to find a single appropriate -partition of in the iteration. This is the clustering that matches the “hybrid” clustering of Giotis and Guruswami. So, the running time of the iteration improves from to . Moreover, the number of same-cluster queries made in the iteration is , thus making the total number of same-cluster queries to be . The theorem is given below. The details of the proof of this theorem is not given since it trivially follows from Giotis and Guruswami (see Theorem 3.2 in [GG2006]).
There is a query algorithm that behaves as follows: On input and a labelling of the edges of a complete graph with vertices, with probability at least , algorithm outputs a clustering of the graph such that the number of agreements induced by this -clustering is at least , where is the optimal number of agreements induced by any -clustering of . The running time of the algorithm is . Moreover, the number of same-cluster queries made by is .
Using the simple observation that (see proof of Theorem 3.1 in [GG2006]), we get that the above query algorithm gives -approximation guarantee in the SSAC framework.
In this section, we provide a -approximation algorithm for the for any small . Giotis and Guruswami [GG2006] provided a -approximation algorithm for . In this work, we extend their algorithm to make it work in the SSAC framework with the aid of same-cluster queries, and thereby improve the running time of the algorithm considerably. Our query algorithm will be closely based on the non-query algorithm of Giotis and Guruswami. In fact, except for a small (but crucial) change, the algorithms are the same. So, we begin by discussing the main ideas and the result by Giotis and Guruswami.
Lemma 11 (Theorem 4.7 in [Gg2006])
For every and , there is a -approximation algorithm for with running time .
The algorithm by Giotis and Guruswami builds on the following ideas. First, from the discussion in the previous section, we know that there is a FPTAS within SSAC framework for . Therefore, unless , the optimal value for is small (OPT, for some small ), the complement solution for would give a valid -approximate solution for . Since is small, this implies that the optimal value for is large which means that for any random vertex in graph , a lot of edges incident on agree to the optimal clustering. Suppose we are given a random subset of vertices that are optimally clustered , and let us assume that is sufficiently large. Since most of the edges in are in agreement with the optimal clustering, we would be able to assign vertices in to their respective clusters greedily. For any arbitrary , assign to for which the number of edges that agree is maximized. Giotis and Guruswami observed that clustering vertices in in this manner would work with high probability when these vertices belong to large clusters. For vertices in small clusters, we may not be able to decide assignments with high probability. They carry out this greedy assignment of vertices in into clusters , and filter out clusters that are sufficiently large and recursively run the same procedure on the union of small clusters.
For any randomly sampled subset of vertices, Giotis and Guruswami try out all possible ways of partitioning into clusters in order to partition optimally into clusters . This ensures that at least one of the partitions matches the optimal partition. However, this exhaustive way of partitioning imposes huge burden on the running time of the algorithm. In fact, their algorithm runs in time. Using access to the same-cluster query oracle, we can obtain a significant reduction in the running time of the algorithm. We query the oracle with pairs of vertices in and since query answers are assumed to be consistent with some unique optimal solution, optimal clustering of vertices in is accomplished using at most same-cluster queries. Once we have a -partitioning of sample that is consistent with the optimal -clusters, we follow the remaining steps of [GG2006]. The same approximation analysis as in [GG2006] follows for the query algorithm. For completeness we give the modified algorithm in Figure 3.1. Let oracle take any two vertices and as input and return ‘Yes’ if they belong to the same cluster in optimal clustering, and ‘No’ otherwise.
Here is the main theorem giving approximation guarantee of the above algorithm. As stated earlier, the proof follows from the proof of a similar theorem (Theorem 4.7 in [GG2006]) by Giotis and Guruswami.
Let . For any input labelling, returns a -clustering with the number of disagreements within a factor of of the optimal. 222Readers familiar with [GG2006] will realise that the statement of the theorem is slightly different from statement of the similar theorem (Theorem 13) in [GG2006]. More specifically, the claim is about the function call with as a parameter rather than . This is done to allow the recursive call in step (9) to be made with same value of precision parameter as the initial call. This does not change the approximation analysis but is crucial for our running time analysis.
Even though the approximation analysis of the query algorithm remains the same as the non-query algorithm of Giotis and Guruswami, the running time analysis changes significantly. Let us write a recurrence relation for the running time of our recursive algorithm. Let denote the running time of the algorithm when node graph is supposed to be clustered into clusters with a given precision parameter . Using the results of the previous subsection, the running time of step (2) is . The running time for partitioning the set is given by which is . Steps (6-8) would cost time which is . So, the recurrence relation for the running time may be written as . This simplifies to . As far as the same-cluster queries are concerned, we can write a similar recurrence relation. which simplifies to . This completes the proof of Theorem 1.2.
4 Conclusion and Open Problems
In this work, we give upper and lower bounds on the number of same-cluster queries to obtain an efficient -approximation for correlation clustering on complete graphs in the SSAC framework of Ashtiani et al. Our lower bound results are based on the Exponential Time Hypothesis (ETH). It is an interesting open problem to give unconditional lower bounds for these problems. Another interesting open problem is to design query based algorithms with faulty oracles. This setting is more practical since in many contexts it may not be known whether any two vertices belong to the same optimal cluster with high confidence. Mitzenmacher and Tsourakakis [MT2016] designed a query based algorithm for clustering where query oracle, similar to our model, answers whether any two vertices belong to the same optimal cluster or otherwise but these answers may be wrong with some probability less than . For , we can use their algorithm to obtain -approximation for with faulty oracle. However, for they needed stronger query model to obtain good clusterings. Designing an efficient -approximation algorithm for with faulty oracle is an interesting open problem.