# Clustering with Noisy Queries

In this paper, we initiate a rigorous theoretical study of clustering with noisy queries (or a faulty oracle). Given a set of n elements, our goal is to recover the true clustering by asking minimum number of pairwise queries to an oracle. Oracle can answer queries of the form : "do elements u and v belong to the same cluster?" -- the queries can be asked interactively (adaptive queries), or non-adaptively up-front, but its answer can be erroneous with probability p. In this paper, we provide the first information theoretic lower bound on the number of queries for clustering with noisy oracle in both situations. We design novel algorithms that closely match this query complexity lower bound, even when the number of clusters is unknown. Moreover, we design computationally efficient algorithms both for the adaptive and non-adaptive settings. The problem captures/generalizes multiple application scenarios. It is directly motivated by the growing body of work that use crowdsourcing for entity resolution, a fundamental and challenging data mining task aimed to identify all records in a database referring to the same entity. Here crowd represents the noisy oracle, and the number of queries directly relates to the cost of crowdsourcing. Another application comes from the problem of sign edge prediction in social network, where social interactions can be both positive and negative, and one must identify the sign of all pair-wise interactions by querying a few pairs. Furthermore, clustering with noisy oracle is intimately connected to correlation clustering, leading to improvement therein. Finally, it introduces a new direction of study in the popular stochastic block model where one has an incomplete stochastic block model matrix to recover the clusters.

There are no comments yet.

## Authors

• 27 publications
• 12 publications
• ### Query Complexity of Clustering with Side Information

Suppose, we are given a set of n elements to be clustered into k (unknow...
06/23/2017 ∙ by Arya Mazumdar, et al. ∙ 0

• ### Same-Cluster Querying for Overlapping Clusters

Overlapping clusters are common in models of many practical data-segment...
10/28/2019 ∙ by Wasim Huleihel, et al. ∙ 0

• ### Semisupervised Clustering by Queries and Locally Encodable Source Coding

Source coding is the canonical problem of data compression in informatio...
03/31/2019 ∙ by Arya Mazumdar, et al. ∙ 0

• ### Optimal Learning of Joint Alignments with a Faulty Oracle

We consider the following problem, which is useful in applications such ...
09/21/2019 ∙ by Kasper Green Larsen, et al. ∙ 0

• ### A Theoretical Analysis of First Heuristics of Crowdsourced Entity Resolution

Entity resolution (ER) is the task of identifying all records in a datab...
02/03/2017 ∙ by Arya Mazumdar, et al. ∙ 0

• ### Predicting Positive and Negative Links with Noisy Queries: Theory & Practice

Social networks and interactions in social media involve both positive a...
09/19/2017 ∙ by Charalampos E. Tsourakakis, et al. ∙ 0

• ### Joint Alignment From Pairwise Differences with a Noisy Oracle

In this work we consider the problem of recovering n discrete random var...
03/13/2020 ∙ by Michael Mitzenmacher, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Clustering is one of the most fundamental and popular methods for data classification. In this paper we initiate a rigorous theoretical study of clustering with the help of a ‘faulty oracle’, a model that captures many application scenarios and has drawn significant attention in recent yearsfootnotetext: A prior version of some of the results of this work appeared in arxiv previously [42, Sec. 6], see https://arxiv.org/abs/1604.01839. In this version we rewrote several proofs for clarity, and included many new results..

Suppose we are given a set of points, that need to be clustered into clusters where is unknown to us. Suppose there is an oracle that can answer pair-wise queries of the form, “do and belong to the same cluster?”. Repeating the same question to the oracle always returns the same answer, but the answer could be wrong with probability (that is slightly better than random answer). We are interested to find the minimum number of queries needed to recover the true clusters with high probability. Understanding query complexity of the faulty oracle model is a fundamental theoretical question [23] with many existing works on sorting and selection [5, 6] where queries are erroneous with probability , and repeating the same question does not change the answer. Here we study the basic clustering problem under this setting which also captures several fundamental applications. Throughout the paper, ‘noisy oracle’ and ‘faulty oracle’ have the same meaning.

Crowdsourced Entity Resolution. Entity resolution (ER) is an important data mining task that tries to identify all records in a database that refer to the same underlying entity. Starting with the seminal work of Fellegi and Sunter [24], numerous algorithms with variety of techniques have been developed for ER [22, 26, 37, 17]. Still, due to ambiguity in representation and poor data quality, accuracy of automated ER techniques has been unsatisfactory. To remedy this, a recent trend in ER has been to use human in the loop. In this setting, humans are asked simple pair-wise queries adaptively, “do and represent the same entity?”, and these answers are used to improve the final accuracy [28, 50, 52, 25, 48, 19, 27, 35, 43]. Proliferation of crowdsourcing platforms like Amazon Mechanical Turk (AMT), CrowdFlower etc. allows for easy implementation. However, data collected from non-expert workers on crowdsourcing platforms are inevitably noisy. A simple scheme to reduce errors could be to take a majority vote after asking the same question to multiple independent crowd workers. However, often that is not sufficient. Our experiments on several real datasets with answers collected from AMT [29, 48] show majority voting could sometime even increase the errors. Interestingly, such an observation has been made by a recent paper as well [47]. There are more complex querying model [47, 51, 49]

, and involved heuristics

[29, 48] to handle errors in this scenario. Let be the probability of error111an approximation of

can often be estimated manually from a small sample of crowd answers.

of a query answer which might also be the aggregated answer after repeating the query several times. Therefore, once the answer has been aggregated, it cannot change. In all crowdsourcing works, the goal is to minimize the number of queries to reduce the cost and time of crowdsourcing, and recover the entities (clusters). This is exactly clustering with noisy oracle. While several heuristics have been developed [48, 28, 49], here we provide a rigorous theory with near-optimal algorithms and hardness bounds.

Signed Edge Prediction. The edge sign prediction problem can be defined as follows. Suppose we are given a social network with signs on all its edges, but the sign from node to , denoted by is hidden. The goal is to recover these signs as best as possible using minimal amount of information. Social interactions or sentiments can be both positive (“like”, “trust”) and negative (“dislike”, “distrust”). [38] provides several such examples; e.g., Wikipedia, where one can vote for or against the nomination of others to adminship [8], or Epinions and Slashdots where users can express trust or distrust, or can declare others to be friends or foes [7, 36]. Initiated by [9, 32], many techniques and related models using convex optimization, low-rank approximation and learning theoretic approaches have been used for this problem [15, 10, 12]. Recently [14, 12, 44] proposed the following model for edge sign prediction. We can query a pair of nodes to test whether indicating and belong to the same cluster or indicating they are not. However, the query fails to return the correct answer with probability , and we want to query the minimal possible pairs. This is exactly the case of clustering with noisy oracle. Our result significantly improves, and generalizes over [14, 12, 44].

Correlation Clustering. In fact, when all pair-wise queries are given, and the goal is to recover the maximum likelihood (ML) clustering, then our problem is equivalent to noisy correlation clustering [4, 41]. Introduced by [4], correlation clustering is an extremely well-studied model of clustering. We are given a graph with each edge labelled either or , the goal of correlation clustering is to either (a) minimize the number of disagreements, that is the number of intra-cluster edges and inter-cluster edges, or (b) maximize the number of agreements that is the number of intra-cluster edges and inter-cluster edges. Correlation clustering is NP-hard, but can be approximated well with provable guarantees [4]. In a random noise model, also introduced by [4] and studied further by [41], we start with a ground truth clustering, and then each edge label is flipped with probability . This is exactly the graph we observe if we make all possible pair-wise queries, and the ML decoding coincides with correlation clustering. The proposed algorithm of [4] can recover in this case all clusters of size , and if “all” the clusters have size , then they can be recovered by [41]. Using our proposed algorithms for clustering with noisy oracle, we can also recover significantly smaller sized clusters given the number of clusters are not too many. Such a result is possible to obtain using the repeated-peeling technique of [3]. However, our running time is significantly better. E.g. for , we have a running time of , whereas for [3], it is dominated by the time to solve a convex optimization over -vertex graph which is at least .

Stochastic Block Model (SBM). The clustering with faulty oracle is intimately connected with the planted partition model, also known as the stochastic block model [34, 21, 20, 2, 1, 30, 16, 45]. The stochastic block model is an extremely well-studied model of random graphs where two vertices within the same community share an edge with probability , and two vertices in different communities share an edge with probability . It is often assumed that , the number of communities, is a constant (e.g. is known as the planted bisection model and is studied extensively [1, 45, 21] or a slowly growing function of (e.g. ). There are extensive literature on characterizing the threshold phenomenon in SBM in terms of the gap between and (e.g. see [2] and therein for many references) for exact and approximate recovery of clusters of nearly equal size222Most recent works consider the region of interest as and for some .. If we allow for different probability of errors for pairs of elements based on whether they belong to the same cluster or not, then the resultant faulty oracle model is an intriguing generalization of SBM. Consider the probability of error for a query on is if and belong to the same cluster and otherwise; but now, we can only learn a subset of the entries of an SBM matrix by querying adaptively. Understanding how the threshold of recovery changes for such an “incomplete” or “space-efficient” SBM will be a fascinating direction to pursue. In fact, our lower bound results extend to asymmetric probability values, while designing efficient algorithms and sharp thresholds are ongoing works. In [13], a locality model where measurements can only be obtained for nearby nodes is studied for two clusters with non-adaptive querying and allowing repetitions. It would also be interesting to extend our work with such locality constraints.

Contributions. Formally the clustering with a faulty oracle is defined as follows.

###### Problem (Query-Cluster).

Consider a set of points containing latent clusters , , , where and the subsets are unknown. There is an oracle with two error parameters . The oracle takes as input a pair of vertices , and if belong to the same cluster then with probability and with probability . On the other hand, if do not belong to the same cluster then with probability and with probability . Such an oracle is called a binary asymmetric channel. A special case would be when , the binary symmetric channel, where the error rate is the same for all pairs. Except for the lower bound, we focus on the symmetric case in this paper. Note that the oracle returns the same answer on repetition. Now, given , find such that is minimum, and from the oracle answers it is possible to recover , with high probability333 high probability implies with probability , where as .
Our contributions are as follows.

Lower Bound (Section 2). We show that is the information theoretic lower bound on the number of adaptive queries required to obtain the correct clustering with high probability even when the clusters are of similar size (see, Theorem 1). Here is the Jensen-Shannon divergence between Bernoulli and distributions. For the symmetric case, that is when , . In particular, if , our lower bound on query complexity is . Developing lower bounds in the interactive setting especially with noisy answers appears to be significantly challenging as popular techniques based on Fano-type inequalities for multiple hypothesis testing [11, 39] do not apply, and we believe our technique will be useful in other noisy interactive learning settings.
Information-Theoretic Optimal Algorithm (Section 3). For the symmetric error case, we design an algorithm which asks at most queries (Theorem 2) matching the lower bound within an factor, whenever .
Computationally Efficient Algorithm (Section 3.2). We next design an algorithm that is computationally efficient and runs in time and asks at most queries. Note that most prior works in SBM, or works on edge sign detection, only consider the case when is a constant [2, 30, 16], even just [45, 1, 14, 12, 44]. As long as, , we get a running time of . We can use this algorithm to recover all clusters of size at least for correlation clustering on noisy graph, improving upon the results of [4, 41]. The algorithm runs in time whenever , as opposed to in [3].
Nonadaptive Algorithm (Section 4). When the queries must be done up-front, for , we give a simple time algorithm that asks queries improving upon [44] where a polynomial time algorithm (at least with a running time of ) is shown with number of queries and over [14, 12] where queries are required under certain conditions on the clusters. Our result generalizes to , and we show interesting lower bounds in this setting. Further, we derive new lower bound showing trade-off between queries and threshold of recovery for incomplete SBM in Sec.  4.1.

## 2. Lower bound for the faulty-oracle model

Note that we are not allowed to ask the same question multiple times to get the correct answer. In this case, even for probabilistic recovery, a minimum size bound on cluster size is required. For example, consider the following two different clusterings. and . Now if one of these two clusterings are given two us uniformly at random, no matter how many queries we do, we will fail to recover the correct clustering with positive probability. Therefore, the challenge in proving lower bounds is when clusters all have size more than a minimum threshold, or when they are all nearly balanced. This removes the constraint on the algorithm designer on how many times a cluster can be queried with a vertex and the algorithms can have greater flexibility. We define a clustering to be balanced if either of the following two conditions hold 1) the maximum size of a cluster is , 2) the minimum size of a cluster is . It is much harder to prove lower bounds if the clustering is balanced.

Our main lower bound in this section uses the Jensen-Shannon (JS) divergence. The well-known KL divergence is defined between two probability mass functions and : Further define the JS divergence as:

. In particular, the KL and JS divergences between two Bernoulli random variable with parameters

and are denoted with and respectively.

###### Theorem 1 (Query-Cluster Lower Bound).

Any (randomized) algorithm must make expected number of queries to recover the correct clustering with probability at least , even when the clustering is known to be balanced.

Note that the lower bound is more effective when and are close. Moreover our actual lower bound is slightly tighter with the expected number of queries required given by

We have to be the -element set to be clustered: . To prove Theorem 1 we first show that, if the number of queries is small, then there exist number of clusters, that are not being sufficiently queried with. Then we show that, since the size of the clusters cannot be too large or too small, there exists a decent number of vertices in these clusters.

The main piece of the proof of Theorem 1 is Lemma 1.

###### Lemma 1.

Suppose, there are clusters. There exist at least clusters such that an element from any one of these clusters will be assigned to a wrong cluster by any randomized algorithm with probability unless the total number of queries involving is more than

###### Proof.

Our first task is to cast the problem as a hypothesis testing problem.

Step 1: Setting up the hypotheses. Let us assume that the clusters are already formed, and we can moreover assume that all elements except for one element has already been assigned to a cluster. Note that, queries that do not involve the said element plays no role in this stage.

Now the problem reduces to a hypothesis testing problem where the th hypothesis for , denotes that the true cluster for is

. We can also add a null-hypothesis

that stands for the vertex belonging to none of the clusters (since is unknown this is a hypothetical possibility for any algorithm444this lower bound easily extend to the case even when is known). Let

denote the joint probability distribution of our observations (the answers to the queries involving vertex

) when is true, . That is for any event we have,

 Pi(A)=Pr(A|Hi).

Suppose denotes the total number of queries made by a (possibly randomized) algorithm at this stage before assigning a cluster. Also let be the

dimensional binary vector that is the result of the queries. The assignment is based on

. Let the random variable denote the number of queries involving cluster In the second phase, we need to identify a set of clusters that are not being queried with enough by the algorithm.

Step 2: A set of “weak” clusters. We must have, Let,

 J1≡{i∈{1,…,k}:E0Ti≤10Tk}.

Since, we have . That is there exist at least clusters in each of where less than (on average under ) queries were made before assignment.

Let . Let

 J2={i∈{1,…,k}:P0(Ei)≤10k}.

Moreover, since we must have, or . Therefore, has size,

 |J|≥2⋅9k10−k=4k5.

Now let us assume that, we are given an element for some to cluster ( is the true hypothesis). The probability of correct clustering is . In the last step, we give an upper bound on probability of correct assignment for this element.

Step 3: Bounding probability of correct assignment for weak cluster elements. We must have,

 Pj(Ej) =P0(Ej)+Pj(Ej)−P0(Ej) ≤10k+|P0(Ej)−Pj(Ej)| ≤10k+∥P0−Pj∥TV≤10k+√12D(P0∥Pj).

where we again used the definition of the total variation distance and in the last step we have used the Pinsker’s inequality [18]. The task is now to bound the divergence . Recall that and

are the joint distributions of the independent random variables (answers to queries) that are identical to one of two Bernoulli random variables:

, which is Bernoulli(), or , which is Bernoulli(). Let

denote the outputs of the queries, all independent random variables. We must have, from the chain rule

[18],

 D(P0∥Pj) =T∑i=1D(P0(xi|x1,…,xi−1)∥Pj(xi|x1,…,xi−1)) =T∑i=1∑(x1,…,xi−1)∈{0,1}i−1P0(x1,…,xi−1)D(P0(xi|x1,…,xi−1)∥Pj(xi|x1,…,xi−1)).

Note that, for the random variable , the term will contribute to only when the query involves the cluster . Otherwise the term will contribute to . Hence,

 D(P0∥Pj) =T∑i=1∑(x1,…,xi−1)∈{0,1}i−1:i%thqueryinvolvesVjP0(x1,…,xi−1)D(q∥p) =D(q∥p)T∑i=1∑(x1,…,xi−1)∈{0,1}i−1:ith query involves VjP0(x1,…,xi−1) =D(q∥p)T∑i=1P0(ith query involves Vj)=D(q∥p)E0Tj≤10TkD(q∥p).

Now plugging this in,

 Pj(Ej)≤10k+√1210TkD(q∥p)≤10k+√12,

if and large enough . Had we bounded the total variation distance with in the Pinsker’s inequality then we would have in the denominator. Obviously the smaller of and would give the stronger lower bound. ∎

Now we are ready to prove Theorem 1.

###### Proof of Theorem 1.

We will show the claim by considering a balanced input. Recall that for a balanced input either the maximum size of a cluster is or the minimum size of a cluster is . We will consider the two cases separately for the proof.

Case 1: the maximum size of a cluster is .

Suppose, the total number of queries is . That means number of vertices involved in the queries is . Note that, there are clusters and elements. Let be the set of vertices that are involved in less than queries. Clearly,

Now we know from Lemma 1 that there exists clusters such that a vertex from any one of these clusters will be assigned to a wrong cluster by any randomized algorithm with probability unless the expected number of queries involving this vertex is more than .

We claim that must have an intersection with at least one of these clusters. If not, then more than vertices must belong to less than clusters. Or the maximum size of a cluster will be which is prohibited according to our assumption.

Now each vertex in the intersection of and the clusters are going to be assigned to an incorrect cluster with positive probability if, Therefore we must have

Case 2: the minimum size of a cluster is .

Let be the set of clusters that are involved in at most queries. That means, This implies, . Now we know from Lemma 1 that there exist clusters (say ) such that a vertex from any one of these clusters will be assigned to a wrong cluster by any randomized algorithm with probability unless the expected number of queries involving this vertex is more than . Quite clearly .

Consider a cluster such that , which is always possible because the intersection is nonempty. is involved in at most queries. Let the minimum size of any cluster be . Now, at least half of the vertices of must each be involved in at most queries. Now each of these vertices must be involved in at least queries (see Lemma 1) to avoid being assigned to a wrong cluster with positive probability. This means, or since . ∎

## 3. Algorithms

In this section, we first develop an information theoretically optimal algorithm for clustering with faulty oracle within an factor of the optimal query complexity. Next, we show how the ideas can be extended to make it computationally efficient. We consider both the adaptive and non-adaptive versions.

### 3.1. Information-Theoretic Optimal Algorithm

Let be the true clustering and be the maximum likelihood (ML) estimate of the clustering that can be found when all queries have been made to the faulty oracle. Our first result obtains a query complexity upper bound within an factor of the information theoretic lower bound. The algorithm runs in quasi-polynomial time, and we show this is the optimal possible assuming the famous planted clique hardness. Next, we show how the ideas can be extended to make it computationally efficient in Section 3.2. We consider both the adaptive and non-adaptive versions.

In particular, we prove the following theorem.

###### Theorem 2.

There exists an algorithm with query complexity for Query-Cluster that returns the ML estimate with high probability when query answers are incorrect with probability . Moreover, the algorithm returns all true clusters of of size at least for a suitable constant with probability .

###### Remark 1.

Assuming , as , , matching the query complexity lower bound within an factor. Thus our upper bound is within a factor of the information theoretic optimum in this range.

#### Finding the Maximum Likelihood Clustering of V with faulty oracle

We can view the clustering problem as following. We have an undirected graph , such that is a union of disjoint cliques , . The subsets are unknown to us; they are called the clusters of . The adjacency matrix of is a block-diagonal matrix. Let us denote this matrix by .

Now suppose, each edge of is erased independently with probability , and at the same time each non-edge is replaced with an edge with probability . Let the resultant adjacency matrix of the modified graph be . The aim is to recover from .

###### Lemma 2.

The maximum likelihood recovery is given by the following:

 maxSℓ,ℓ=1,⋯:V=⊔ℓSℓ ∏ℓ∏i,j∈Sℓ,i≠jP+(zi,j)∏r,t,r≠t∏i∈Sr,j∈StP−(zi,j) =maxSℓ,ℓ=1,⋯:V=⊔ℓ=1Sℓ ∏ℓ∏i,j∈Sℓ,i≠jP+(zi,j)P−(zi,j)∏i,j∈V,i≠jP−(zi,j).

where,

Therefore, the ML recovery asks for,

 maxSℓ,ℓ=1,⋯:V=⊔ℓ=1Sℓ∑ℓ∑i,j∈Sℓ,i≠jlnP+(zi,j)P−(zi,j).

Note that,

 lnP+(0)P−(0)=−lnP+(1)P−(1)=lnp1−p.

Hence the ML estimation is,

 (1) maxSℓ,ℓ=1,⋯:V=⊔ℓ=1Sℓ∑ℓ∑i,j∈Sℓ,i≠jωi,j,

where , i.e., when and when . Further We will use this fact to prove Theorem 2 and Theorem 3 below.

Note that (1) is equivalent to finding correlation clustering in with the objective of maximizing the consistency with the edge labels, that is we want to maximize the total number of positive intra-cluster edges and total number of negative inter-cluster edges [4, 41, 40]. This can be seen as follows.

 maxSℓ,ℓ=1,⋯:V=⊔ℓ=1Sℓ∑ℓ∑i,j∈Sℓ,i≠jωi,j =maxSℓ,ℓ=1,⋯:V=⊔ℓ=1Sℓ[∑ℓ∑i,j∈Sℓ,i≠j∣∣(i,j):ωi,j=+1∣∣+[∑r,t:r≠t∣∣(i,j):i∈Sr,j∈St,ωi,j=−1∣∣].

Therefore (1) is same as correlation clustering. However going forward we will be viewing it as obtaining clusters with maximum intra-cluster weight. That will help us to obtain the desired running time of our algorithm. Also, note that, we have a random instance of correlation clustering here, and not a worst case instance.

#### Algorithm. 1

The algorithm that we propose has several phases. The main idea is as follows. We start by selecting a small subset of vertices, and extract the heaviest weight subgraph in it by suitably defining edge weight. If the subgraph extracted has size, we are confident that it is part of an original cluster. We then grow it completely, where a decision to add a new vertex to it happens by considering the query answers involving these different vertices and the new vertex. Otherwise, if the subgraph extracted has size less than , we select more vertices. We note that we would never have to select more than vertices, because by pigeonhole principle, this will ensure that we have selected at least members from a cluster, and the subgraph detected will have size at least . This helps us to bound the query complexity. We emphasize that our algorithm is completely deterministic.

Phase 1: Selecting a small subgraph. Let .

1. Select vertices arbitrarily from . Let be the set of selected vertices. Create a subgraph by querying for every and assigning a weight of if the query answer is “yes” and otherwise .

2. Extract the heaviest weight subgraph in . If , move to Phase 2.

3. Else we have . Select a new vertex , add it to , and query with every vertex in . Move to step (2).

Phase 2: Creating an Active List of Clusters. Initialize an empty list called when Phase 2 is executed for the first time.

1. Add to the list .

2. Update by removing from and every edge incident on . For every vertex , if , include in and remove from with all edges incident to it.

3. Extract the heaviest weight subgraph in . If , Move to step(1). Else move to Phase .

Phase 3: Growing the Active Clusters. We now have a set of clusters in .

1. Select an unassigned vertex not in (that is previously unexplored), and for every cluster , pick distinct vertices in the cluster and query with them. If the majority of these answers are “yes”, then include in .

2. Else we have for every the majority answer is “no” for . Include and query with every node in and update accordingly. Extract the heaviest weight subgraph from and if its size is at least move to Phase 2 step (1). Else move to Phase 3 step (1) by selecting another unexplored vertex.

Phase 4: Maximum Likelihood (ML) Estimate.

1. When there is no new vertex to query in Phase , extract the maximum likelihood clustering of and return them along with the active clusters, where the ML estimation is defined in Equation 1.

#### Analysis.

To establish the correctness of the algorithm, we show the following. Suppose all queries on have been made. If the ML estimate of the clustering with these answers is same as the true clustering of that is, then the algorithm for faulty oracle finds the true clustering with high probability.

Let without loss of generality, . We will show that Phase - recover with probability at least . The remaining clusters are recovered in Phase .

A subcluster is a subset of nodes in some cluster. Lemma 5 shows that any set that is included in in Phase of the algorithm is a subcluster of . This establishes that all clusters in at any time are subclusters of some original cluster in . Next, Lemma 7 shows that elements that are added to a cluster in are added correctly, and no two clusters in can be merged. Therefore, clusters obtained from are the true clusters. Finally, the remaining of the clusters can be retrieved from by computing a ML estimate on in Phase , leading to Theorem 3.

We will use the following version of the Hoeffding’s inequality heavily in our proof. We state it here for the sake of completeness.

Hoeffding’s inequality for large deviation of sums of bounded independent random variables is well known [33][Thm. 2].

###### Lemma 3 (Hoeffding).

If are independent random variables and for all Then

 Pr(|1nn∑i=1(Xi−EXi)|≥t)≤2exp(−2n2t2∑ni=1(bi−ai)2).

This inequality can be used when the random variables are independently sampled with replacement from a finite sample space. However due to a result in the same paper [33][Thm. 4], this inequality also holds when the random variables are sampled without replacement from a finite population.

###### Lemma 4 (Hoeffding).

If are random variables sampled without replacement from a finite set , and for all Then

 Pr(|1nn∑i=1(Xi−EXi)|≥t)≤2exp(−2nt2(b−a)2).
###### Lemma 5.

Let . Algorithm in Phase and returns a subcluster of of size at least with high probability if contains a subcluster of of size at least . Moreover, it does not return any set of vertices of size at least if does not contain a subcluster of of size at least .

###### Proof.

Let , , for , and . Suppose without loss of generality . The lemma is proved via a series of claims. The proofs of the claims are delegated to Appendix A.

###### Claim 1.

Let . Then a set for some will be returned with high probability when is processed.

###### Claim 2.

Let . Then a set for some with size at least will be returned with high probability when is processed.

###### Claim 3.

If . then no subset of size will be returned by the algorithm for faulty oracle when processing with high probability.

Since, the algorithm attempts to extract a heaviest weight subgraph at most times, and each time the probability of failure is at most . By union bound, all the calls succeed with probability at least . This establishes the lemma. ∎

We will need the following version of Chernoff bound as well.

###### Lemma 6 (Chernoff Bound).

Let be independent binary random variables, and with . Then for any

 Pr[X≥(1+ϵ)μ]≤exp(−ϵ22+ϵμ)

and,

 Pr[X≤(1−ϵ)μ]≤exp(−ϵ22μ)
###### Lemma 7.

The list contains all the true clusters of of size at the end of the algorithm with high probability.

###### Proof.

From Lemma 5, any cluster that is added to in Phase is a subset of some original cluster in with high probability, and has size at least . Moreover, whenever contains a subcluster of of size at least , it is retrieved by the algorithm and added to .

When a vertex is added to a cluster in , we have at that time, and there exist distinct members of , say, such that majority of the queries of with these vertices returned . Let if possible . Then the expected number of queries among the queries that had an answer “yes” (+1) is . We now use the Chernoff bound, Lemma 6 bound, to have,

 Pr(v added to C∣v∉C)≤e−lp(12p−1)22+(12p−1)≤1n3.

On the other hand, if there exists a cluster such that , then while growing , will be added to (either already belongs to , or is a newly considered vertex). This again follows by the Chernoff bound. Here the expected number of queries to be answered “yes” is . Hence the probability that less than queries will be answered yes is . Therefore, for all , if is included in a cluster in , the assignment is correct with probability at least . Also, the assignment happens as soon as such a cluster is formed in and is explored (whichever happens first).

Furthermore, two clusters in cannot be merged. Suppose, if possible there are two clusters and which ought to be subset of the same cluster in . Let without loss of generality is added later in . Consider the first vertex that is considered by our algorithm. If is already there in at that time, then with high probability will be added to in Phase . Therefore, must have been added to after has been considered by our algorithm and added to . Now, at the time is added to in Phase , , and again will be added to with high probability in Phase –thereby giving a contradiction.

This completes the proof of the lemma. ∎

###### Theorem 3.

If the ML estimate of the clustering of with all possible queries return the true clustering, then the algorithm for faulty oracle returns the true clusters with high probability. Moreover, it returns all the true clusters of of size at least with high probability.

###### Proof.

From Lemma 5 and Lemma 7, contains all the true clusters of of size at least with high probability. Any vertex that is not included in the clusters in at the end of the algorithm, are in . Also contains all possible pairwise queries among them. Clearly, then the ML estimate of will be the true ML estimate of the clustering restricted to these clusters. ∎

Finally, once all the clusters in are grown, we have a fully queried graph in containing the small clusters which can be retrieved in Phase . This completes the correctness of the algorithm. With the following lemma, we get Theorem 2.

###### Lemma 8.

The query complexity of the algorithm for faulty oracle is .

###### Proof.

Let there be clusters in when is considered by the algorithm. could be in which case is considered in Phase , else is considered in Phase . Therefore, is queried with at most members, each from the clusters. If is not included in one of these clusters, then is added to and queried will all vertices in . We have seen in the correctness proof (Lemma 3) that if contains at least vertices from any original cluster, then ML estimate on retrieves those vertices as a cluster with high probability. Hence, when is queried with the vertices in , . Thus the total number of queries made when the algorithm considers is at most , where when the error probability is . This gives the query complexity of the algorithm considering all the vertices, which matches the lower bound computed in Section 2 within an factor. ∎

Now combining all these we get the statement of Theorem 2.

#### Running Time & Connection to Planted Clique

While the algorithm described above is very close to information theoretic optimal, the running time is not polynomial. Moreover, it is unlikely that the algorithm can be made efficient.

A crucial step of our algorithm is to find a large cluster of size at least , which can of course be computed in time. However, since size of is bounded by , the running time to compute such a heaviest weight subgraph is . This running time is unlikely to be improved to a polynomial. This follows from the planted clique conjecture.

###### Conjecture 1 (Planted Clique Hardness).

Given an Erdős-Rényi random graph , with , the planted clique conjecture states that if we plant in a clique of size where , then there exists no polynomial time algorithm to recover the largest clique in this planted model.

Reduction. Given such a graph with a planted clique of size , we can construct a new graph by randomly deleting each edge with probability . Then in , there is one cluster of size where edge error probability is and the remaining clusters are singleton with inter-cluster edge error probability being . So, if we can detect the heaviest weight subgraph in polynomial time in the faulty oracle algorithm, then there will be a polynomial time algorithm for the planted clique problem.

In fact, the reduction shows that if it is computationally hard to detect a planted clique of size for some value of , then it is also computationally hard to detect a cluster of size in the faulty oracle model. Note that . In the next section, we propose a computationally efficient algorithm which recovers all clusters of size at least with high probability, which is the best possible assuming the conjecture, and can potentially recover much smaller sized clusters if .

### 3.2. Computationally Efficient Algorithm

#### Known k

We first design an algorithm when , the number of clusters is known. Then we extend it to the case of unknown . The algorithm is completely deterministic.

###### Theorem 4.

There exists a polynomial time algorithm with query complexity for Query-Cluster with error probability and known , that recovers all clusters of size at least .

The algorithm is given below.

Algorithm 2. Let . We define two thresholds and . The algorithm is as follows.

Phase 1-2C: Select a Small Subgraph. Initially we have an empty graph , and all vertices in are unassigned to any cluster.

1. Select new vertices arbitrarily from the unassigned vertices in and add them to such that the size of is . If there are not enough vertices left in , select all of them. Update by querying for every such that and and assigning a weight of