# Semisupervised Clustering by Queries and Locally Encodable Source Coding

Source coding is the canonical problem of data compression in information theory. In a locally encodable source coding, each compressed bit depends on only few bits of the input. In this paper, we show that a recently popular model of semisupervised clustering is equivalent to locally encodable source coding. In this model, the task is to perform multiclass labeling of unlabeled elements. At the beginning, we can ask in parallel a set of simple queries to an oracle who provides (possibly erroneous) binary answers to the queries. The queries cannot involve more than two (or a fixed constant number Δ of) elements. Now the labeling of all the elements (or clustering) must be performed based on the (noisy) query answers. The goal is to recover all the correct labelings while minimizing the number of such queries. The equivalence to locally encodable source codes leads us to find lower bounds on the number of queries required in variety of scenarios. We are also able to show fundamental limitations of pairwise `same cluster' queries - and propose pairwise AND queries, that provably performs better in many situations.

## Authors

• 27 publications
• 10 publications
• ### Coding for Crowdsourced Classification with XOR Queries

This paper models the crowdsourced labeling/classification problem as a ...
06/25/2019 ∙ by James, et al. ∙ 0

• ### Clustering with Noisy Queries

In this paper, we initiate a rigorous theoretical study of clustering wi...
06/22/2017 ∙ by Arya Mazumdar, et al. ∙ 0

• ### Query Complexity of Clustering with Side Information

Suppose, we are given a set of n elements to be clustered into k (unknow...
06/23/2017 ∙ by Arya Mazumdar, et al. ∙ 0

• ### Query-Efficient Correlation Clustering

Correlation clustering is arguably the most natural formulation of clust...
02/26/2020 ∙ by David Garcia Soriano, et al. ∙ 0

• ### Noise-Tolerant Interactive Learning from Pairwise Comparisons

We study the problem of interactively learning a binary classifier using...
04/19/2017 ∙ by Yichong Xu, et al. ∙ 0

• ### Same-Cluster Querying for Overlapping Clusters

Overlapping clusters are common in models of many practical data-segment...
10/28/2019 ∙ by Wasim Huleihel, et al. ∙ 0

• ### COBRAS: Fast, Iterative, Active Clustering with Pairwise Constraints

Constraint-based clustering algorithms exploit background knowledge to c...
03/29/2018 ∙ by Toon Van Craenendonck, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Suppose we have elements, and the th element has a label

. We consider the task of learning the labels of the elements (or learning the label vector). This can also be easily thought of as a clustering problem of

elements into clusters, where there is a ground-truth clustering111The difference between clustering and learning labels is that in the case of clustering it is not necessary to know the value of the label for a cluster. Therefore any unsupervised labeling algorithm will be a clustering algorithm, however the reverse is not true. In this paper we are mostly concerned about the labeling problem, hence our algorithms (upper bounds) are valid for clustering as well.. There exist various approaches to this problem in general. In many cases some similarity values between pair of elements are known (a high similarity value indicate that they are in the same cluster). Given these similarity values (or a weighted complete graph), the task is equivalent to to graph clustering; when perfect similarity values are known this is equivalent to finding the connected components of a graph.

A recent approach to clustering has been via crowdsourcing. Suppose there is an oracle (expert labelers, crowd workers) with whom we can make pairwise queries of the form “do elements and belong to the same cluster?”. We will call this the ‘same cluster’ query (as per [3]). Based on the answers from the oracle, we then try to reconstruct the labeling or clustering. This idea has seen a recent surge of interest especially in the entity resolution research (see, for e.g. [34, 31, 7, 20]). Since each query to crowd workers cost time and money, a natural objective will be to minimize the number of queries to the oracle and still recover the clusters exactly. Carefully designed adaptive and interactive querying algorithms for clustering has also recently been developed [34, 31, 7, 22, 21]. In particular, the query complexity for clustering with a -means objective had recently been studied in [3], and there are significant works in designing optimal crowdsourcing schemes in general (see, [12, 13, 29, 35, 15]). Note that, a crowd worker may potentially handle more than two elements at a time; however it is of interest to keep the number of elements involved in a query as small as possible. As an example, recent work in [32] considers triangle queries (involving three elements in a query). Also crowd workers can compute some simple functions on this small set of inputs - instead of answering a ‘same cluster’ query. But again it is desirable that the answer the workers provide to be simple, such as a binary answer.

The queries to the oracle can be asked adaptively or non-adaptively. For the clustering problem, both the adaptive version and the nonadaptive versions have been studied. While both versions have obvious advantages and disadvantages, for crowdsourcing applications it is helpful to have a parallelizable querying scheme in most scenarios for faster response-rate and real time analysis. In this paper, we concentrate on the nonadaptive version of the problem, i.e., we perform the labeling algorithm after all the query answers are all obtained.

Budgeted crowdsourcing problems can be quite straight-forwardly viewed as a canonical source-coding or source-channel coding problem of information theory (e.g., see the recent paper [14]). A main contribution of our paper is to view this as a locally encodable source coding problem: a data compression problem where each compressed bit depends only on a constant number of input bits. The notion of locally encodable source coding is not well-studied even within information theory community, and the only place where it is mentioned to the best of our knowledge is in [24], although the focus of that paper is a related notion of smooth encoding. Another related notion of local decoding seem to be much more well-studied [19, 18, 16, 27, 5, 26, 4, 33].

By posing the querying problem as such we can get a number of information theoretic lower bounds on the number of queries required to recover the correct labeling. We also provide nonadaptive schemes that are near optimal. Another of our main contributions is to show that even within queries with binary answers, ‘same cluster’ queries (or XOR queries) may not be the best possible choice. A smaller number of queries can be achieved for approximate recovery by using what we call an AND query. Among our settings, we also consider the case when the oracle gives incorrect answers with some probability. A simple scheme to reduce errors in this case could be to take a majority vote after asking the same question to multiple different crowd workers. However, often that is not sufficient. Experiments on several real datasets (see

[21]) with answers collected from Amazon Mechanical Turk [9, 30] show that majority voting could even increase the errors. Interestingly, such an observation has been made by a recent paper as well [28, Figure 1]. The probability of error of a query answer may also be thought of as the aggregated answer after repeating the query several times. Once the answer has been aggregated, it cannot change – and thus it reduces to the model where repeating the same question multiple times is not allowed. On the other hand, it is usually assumed that the answers to different queries are independently erroneous (see [10]). Therefore we consider the case where repetition of a same query multiple times is not allowed222Independent repetition of queries is also theoretically not interesting, as by repeating any query just times one can reduce the probability of error to near zero., however different queries can result in erroneous answers independently.

In this case, the best known algorithms need queries to perform the clustering with two clusters [21]. We show that by employing our AND querying method -proportion of all labels in the label vector will be recovered with only queries.

Along the way, we also provide new information theoretic results on fundamental limits of locally encodable source coding. While the the related notion of locally decodable source code [19, 16, 27, 5], as well as smooth compression [24, 27] have been studied, there was no nontrivial result known related to locally encodable codes in general. Although the focus of this paper is primarily theoretical, we also perform a real crowdsourcing experiment to validate our algorithm.

## 2 Problem Setup and Information Theoretic View

For elements, consider a label vector , where , the th entry of , is the label of the th element and can take one of possible values. Suppose and ’s are independent. In other words, the prior distribution of the labels is given by the vector . For the special case of , we denote and . While we emphasize on the case of our results extends in the case of larger , as will be mentioned.

A query is a deterministic function that takes as argument labels, , and outputs a binary answer. While the query answer need not be binary, we restrict ourselves mostly to this case for being the most practical choice.

Suppose a total of queries are made and the query answers are given by . The objective is to reconstruct the label vector from , such that the number of queries is minimized.

We assume our recovery algorithms to have the knowledge of

. This prior distribution, or the relative sizes of clusters, is usually easy to estimate by subsampling a small (

) subset of elements and performing a complete clustering within that set (by, say, all pairwise queries). In many prior works, especially in the recovery algorithms of popular statistical models such as stochastic block model, it is assumed that the relative sizes of the clusters are known (see [1]).

We also consider the setting where query answers may be erroneous with some probability of error. For crowdsourcing applications, this is a valid assumption since many times even expert labelers can make errors, and such assumption can be made. To model this we assume each entry of is flipped independently with some probability . Such independence assumption has been used many times previously to model errors in crowdsourcing systems (see, e.g., [10]). While this may not be the perfect model, we do not allow a single query to be repeated multiple times in our algorithms (see the Introduction for a justification). For the analysis of our algorithm we just need to assume that the answers to different queries are independent. While we analyze our algorithms under these assumptions for theoretical guarantees, it turns out that even in real crowdsourcing situations our algorithmic results mostly follow the theoretical results, giving further validation to the model.

For the case, and when (perfect oracle), it is easy to see that queries are sufficient for the task. One simply compares every element with the first element. This does not extend to the case when : for perfect recovery, and without any prior, one must make queries in this case. When (erroneous oracle), it has been shown that a total number of queries are sufficient [21], where is the ratio of the sizes of the largest and smallest clusters.

##### Information theoretic view.

The problem of learning a label vector from queries is very similar to the canonical source coding (data compression) problem from information theory. In the source coding problem, a (possibly random) vector is ‘encoded’ into a usually smaller length binary vector called the compressed vector333The compressed vector is not necessarily binary, nor it is necessarily smaller length. . The decoding task is to again obtain from the compressed vector . It is known that if each entry of is independently distributed according to , then is both necessary and sufficient to recover with high probability, where is the entropy of .

We can cast our problem in this setting naturally, where entries of are just answers to queries made on . The main difference is that in source coding each may potentially depend on all the entries of ; while in the case of label learning, each may depend on only of the s.

We call this locally encodable source coding. This terminology is analogous to the recently developed literature on locally decodable source coding [19, 16]. It is called locally encodable, because each compressed bit depend only on of the source (input) bits. For locally decodable source coding, each bit of the reconstructed sequence depends on at most a prescribed constant number of bits from the compressed sequence. Another closely related notion is that of ‘smooth compression’, where each source bit contributes to at most compressed bits [24]. Indeed, in [24], the notion of locally encodable source coding is also present where it was called robust compression. We provide new information theoretic lower bounds on the number of queries required to reconstruct exactly and approximately for our problem.

For the case when there are only two labels, the ‘same cluster’ query is equivalent to an Boolean XOR operation between the labels. There are some inherent limitations to these functions that prohibit the ‘same cluster’ queries to achieve the best possible number of queries for the ‘approximate’ recovery of labeling problem. We use an old result by Massey [17] to establish this limitation. We show that, instead using an operation like Boolean AND, much smaller number of queries are able to recover most of the labels correctly.

We also consider the case when the oracle gives faulty answer, or is corrupted by some noise (the binary symmetric channel). This setting is analogous to the problem of joint source-channel coding. However, just like before, each encoded bit must depend on at most bits. We show that for the approximate recovery problem, AND queries are again performing substantially well. In a real crowdsourcing experiment, we have seen that if crowd-workers have been provided with the same set of pairs and being asked for ‘same cluster’ queries as well as AND queries, the error-rate of AND queries is lower. The reason is that for a correct ‘no’ answer in an AND query, a worker just need to know one of the labels in the pair. For a ‘same cluster’ query, both the labels must be known to the worker for any correct answer.

There are multiple reasons why one would ask for a ‘combination’ or function of multiple labels from a worker instead of just asking for a label itself (a ‘label-query’). Note that, asking for labels will never let us recover the clusters in less than queries, whereas, as we will see, the queries that combine labels will. Also in case of erroneous answer with AND queries or ‘same cluster’ queries, we have the option of not repeating a query, and still reduce errors. No such option is available with direct label-queries.

###### Remark 1.

Note that, using only ‘same cluster’ queries at best the complete clustering can be recovered, and it is not possible to recover the labeling unless some other information is also available. Indeed, if the prior is known, and , then by looking at the complete clustering it is possible to figure out the labeling (or to correctly assign the clusters their respective labels, since the label vector must belong to the ‘typical set’ of . This implies, for the case of two clusters, if then the labeling can be inferred with high probability from the clustering.

##### Contributions.

In summary our contributions can be listed as follows.

1. Noiseless queries and exact recovery (Sec. 3.1): For two labels/clusters, we provide a querying scheme that asks

nonadaptive pairwise ‘same cluster’ queries, and recovers the labels with high probability, for a range of prior probabilities. We also show that, this result is order-wise optimal. If instead we involve

elements in each of the queries, then with number of nonadaptive XOR queries we recover all labels with high probability, for a range of prior probabilities. We also provide a new lower bound that is strictly better than for some .

2. Noiseless queries and approximate recovery (Sec. 3.2): We provide a new lower bound on the number of queries required to recover fraction of the labels . We also show that ‘same cluster’ queries are insufficient, and propose a new querying strategy based on AND operation that performs substantially better.

3. Noisy queries and approximate recovery (Sec. 3.3). For this part we assumed the query answer to be -ary () where is the number of clusters. This section contains the main algorithmic result that uses the AND queries as main primitive. We show that, even in the presence of noise in the query answers, it is possible to recover proportion of all labels correctly with only nonadaptive queries. We validate this theoretical result in a real crowdsourcing experiment in Sec. 4.

## 3 Main results and techniques

### 3.1 Noiseless queries and exact recovery

In this scenario we assume the query answer from the oracle to be perfect and we wish to get back all of the original labels exactly without any error. Each query is allowed to take only labels as input. When , we are allowed to ask only pairwise queries. Let us consider the case when there are only two labels, i.e., . That means the labels are iid Bernoulli(

) random variable. Therefore the number of queries

that are necessary and sufficient to recover all the labels with high probability is approximately where is the binary entropy function. However the sufficiency part here does not take into account that each query can involve only labels.

#### 3.1.1 Same cluster queries (Δ=2)

We warm up with the case of only two labels (clusters), and the same cluster queries. For two labels, a same cluster query simply amounts to being the modulo 2 sum of the two label values. It is easy to see that querying the first label () with every other label allows us to infer the clustering of the labels since we can simply group the labels which are same as the first label and the labels which are different to the first label separately. As mentioned in Remark 1, if , then just by counting the sizes of the groups, it is possible to get the labels correctly as well (among the two possible labelings, given the clustering). The query complexity in this case is but we can have the following scheme that uses less than queries by building on the aforementioned idea.

##### Querying scheme.

Our scheme works in the following manner. First, we partition the elements into disjoint and equal sized groups each containing elements (assume ). For each group, we query the first element of the group with every other element of the group. Now, for each group we can cluster the labels and we will identify

• the smaller cluster with the label 1 and the larger cluster with the label 0, when

• the smaller cluster with the label 0 and the larger cluster with the label 1, when .

Going forward, let us just assume without any loss of generality. Finally we can aggregate all the elements identified with the label 1 and with label 0 and return them as our final output. The total number of queries required in this scheme is .

###### Theorem 1.

For the querying scheme described above, same cluster queries are sufficient to recover the label vector with probability at least where

is the Kullback-Leibler Divergence between two Bernoulli distributed random variables with parameters

and respectively.

###### Proof.

Let us set We omit the use of ceilings and floor to maintain clarity. Within a group of elements, we are able to obtain the complete clustering. We will fail to obtain the labeling only if the larger cluster has the true label 1. However this happens with probability at most [6]. Since there are groups, the probability that we fail to recover the labels in any one of the groups is at most . The total number of queries is . ∎

Now, we will show a matching lower bound that proves that the reduction on query complexity presented in the scheme described above to be tight up to constant factors. In particular, we prove the following theorem.

###### Theorem 2.

Consider the binary labeling problem with pairwise queries. If the number of ‘same cluster’ queries is less than for any positive constant , then any querying scheme will fail to recover the labels of all elements with positive probability.

To prove this theorem we will need the following lemma.

###### Lemma 1 (Theorem 11.1.4 in [6]).

Consider a vector whose elements are i.i.d random variables sampled according to , . The probability that more than are 1 is at least .

###### Proof of Theorem 2.

Suppose a querying scheme uses pairwise ‘same cluster’ queries. Consider a graph with vertices corresponding to the querying scheme. The vertices are labeled , and the edge exists if and only if is a query. Since the number of edges in the graph is , the graph has at least components. Therefore average size of any component in the graph is . This also implies that there exists at least components with size at most each (using Markov inequality).

Consider the elements corresponding to vertices of one such component. Even if all the possible ‘same cluster’ queries were made within this group, we will still have two possible labelings that would be consistent with all possible query answers (for every assignment of labels, a new assignment can be created by flipping all the labels). Since this set of elements are not queried with any element outside of it, this will give rise to different possibilities. This situation can only be mitigated if we can turn the clustering into labeling. Since there are two possible labelings within a group, we will make a mistake in labeling only when the number of elements with label 1 is larger than the number of elements with label 0 (the maximum likelihood decoding will fail).

Now, using Lemma 1, within a component of size at most , the probability that the number of elements with label 1 is less than the number of elements with label 0 is at most

 1−2−2nD(12||p)/a(2n/a+1)2.

And the probability that this happens for all such components is at most

 (1−2−2nD(12||p)/a(2n/a+1)2)a/2≤exp(−a2−2nD(12||p)/a2(2n/a+1)2)=o(1),

if we substitute for any positive constant .

The main takeaway from this part is that, by exploiting the prior probabilities (or relative cluster sizes), it is possible to infer the labels with strictly less than ‘same cluster’ queries. However, to make the reduction we need to look at either a different type of querying, or involve more than two elements in a query.

#### 3.1.2 XOR queries for larger Δ

Querying scheme: We use the following type of queries. For each query, labels of elements are given to the oracle, and the oracle returns a simple XOR operation of the labels. Note, for , our queries are just ‘same cluster’ queries. Let us define to be the binary query matrix of dimension where each row has at most ones, other entries being zero. Now for a label vector we can represent the set of query outputs by . In order to fulfill our objective, we will define a random ensemble of query matrices from which is sampled and subsequently, we will show that the average probability of error goes to zero asymptotically with that implies the existence of a good matrix in the ensemble.

Random ensemble: The random ensemble will be defined in terms of a bipartite graph. This is done by constructing a biregular bipartite graph , with . Here called the left vertices, represents the labels; and the right vertices, represents the query. The degree of each left vertex is and the degree of each right vertex is . We have . Now, a permutation is randomly sampled from the set of all permutation on , the symmetric group . If we fix an ordering of the edges emanating from vertices in the left and right, then th edge will be joined with the th edge on the right.

Decoder: In this setting we will be concerned with the exact recovery of the labels. The decoder will look at the vector and return a vector such that the Hamming weight (number of nonzero entries) of the vector is given by and . Hence, the probability of error can be defined as

 Pe≡∑X∈{0,1}nPr(X)PrQ∼Q(Ψ(QX)≠X).

We have the following theorem

###### Theorem 3.

Consider an query matrix sampled from the ensemble of matrices described above, and a label vector with each entry being i.i.d. . Let be the left and the right degrees of the ensemble with and let . If the number of queries

 m>nmaxβ≤x≤2ppH(x2p)+(1−p)H(x2(1−p))1−log(1+(1−2x)Δ),

then the average probability of error goes to zero as

 Pe≤n2−c(1−p+p2)(Δc)c(1+o(1)).

The proof of this theorem is delegated to Section 5. The same ensemble was used by [23] where the authors showed the existence of linear codes achieving zero probability of error in the Binary Symmetric Channel such that the parity check matrix of the code belonged to the ensemble . Because of the duality of source-channel coding, their guarantees on the average probability of error for directly translate to our setting as well. However, our analysis is slightly different from [23] and it is tighter in most cases. We have delegated the detailed comparison of the two analyses to Section 5. The achievability result is depicted in Figure 1.

#### 3.1.3 Lower bounds (converse)

Now we provide some necessary conditions on the number of queries, involving elements at most, required for a full recovery of labels. First of all notice that, if a query involves at most elements, then queries are necessary for exact recovery. If a particular label is not present in any query, then the decoder has no other choice but to guess the label. This will lead to a constant probability of error . Therefore at least queries are necessary so that every label is present in at least one query.

Adapting Gallager’s result for low density parity-check matrix codes for our setting (using source-channel duality for linear codes) we can have the following lower bound on the number of queries.

###### Theorem 4 ([8]).

Assume a label vector with each entry being i.i.d. . If the number of XOR queries, each involving at most labels, is less than , then the probability of error in recovery of labels is bounded from below by a constant independent of the number of elements .

However Gallager’s result is valid for only XOR queries. We can provide a lower bound that is close to Gallager’s bound, and holds for any type of query function.

###### Theorem 5.

Assume a label vector with each entry being i.i.d. . The minimum number of queries, each involving at most elements, necessary to recover all labels with high probability is at least by where

###### Proof.

Every query involves at most elements. Therefore the average number of queries an element is part of is . Therefore fraction of all the elements (say the set ) are part of less than queries. Now consider the set . Consider all typical label vectors such that their projection on is a fixed typical sequence. We know that there are such sequences. Let be one of these sequences. Now, almost all sequences of must have a distance of from . Let be the corresponding query outputs when is the input. Now any query output for input belonging to must reside in a Hamming ball of radius from . Therefore, comparing the volume of the balls, we must have

Finally, in Figure 1, we have compared the achievability scheme (Theorem 3) and the lower bounds presented above for . For larger the bounds are even closer.

### 3.2 Noiseless queries and approximate recovery

Again let us consider the case when , i.e., only two possible labels. Let be the label vector. Suppose we have a querying algorithm that, by using queries, recovers a label vector .

###### Definition.

We call a querying algorithm to be -good if for any label vector, at least labels are correctly recovered. This means for any label-recovered label pair , the Hamming distance is at most . For an almost equivalent definition, we can define a distortion function , for any two labels . We can see that , which we want to be bounded by .

Using standard rate-distortion theory [6], it can be seen that, if the queries could involve an arbitrary number of elements then with queries it is possible to have a -good scheme where . When each query is allowed to take only at most inputs, we have the following lower bound for -good querying algorithms.

###### Theorem 6.

In any -good querying scheme with queries where each query can take as input elements, the following must be satisfied (below ):

 δ≥~δ(mn)+H(p)−H(~δ(mn))h′(~δ(mn))(1+eΔh′(~δ(mn))).

The proof of this theorem is somewhat involved, and we have provided it in Section 6.

One of the main observation that we make is that the ‘same cluster’ queries are highly inefficient for approximate recovery. This follows from a classical result of Ancheta and Massey [17] on the limitation of linear codes as rate-distortion codes. Recall that, the ‘same cluster’ queries are equivalent to XOR operation in the binary field, which is a linear operation on . We rephrase a conjecture by Massey in our terminology.

###### Conjecture 1 (‘same cluster’ query lower bound).

For any -good scheme with ‘same cluster’ queries (), the following must be satisfied:

This conjecture is known to be true at the point (equal sized clusters). We have plotted these two lower bounds in Figure 3.

Now let us provide a querying scheme with that will provably be better than ‘same cluster’ schemes.

Querying scheme: AND queries: We define the AND query as where so that only when both the elements have labels equal to . For each pairwise query the oracle will return this AND operation of the labels.

###### Theorem 7.

There exists a -good querying scheme with pairwise AND queries such that

 δ=pe−2mn+n∑d=1e−2mn(2mn)dd!d∑k=1(nk)f(k,d)nd(1−p)kp

where

###### Proof.

Assume without loss of generality. Consider a random bipartite graph where each ‘left’ node represent an element labeled according to the label vector and each ‘right’ node represents a query. All the query answers are collected in . The graph has right-degree exactly equal to 2. For each query the two inputs are selected uniformly at random without replacement.

Recovery algorithm: For each element we look at the queries that involves it and estimate its label as if any of the query answers is and predict otherwise. If there are no queries that involves the element, we simply output 0 as the label.

Since the average left-degree is

and since all the edges from the right nodes are randomly and independently thrown, we can model the degree of each left-vertex by a Poisson distribution with the mean

. We define element to be a two-hop-neighbor of if there is at least one query which involved both the elements and . Under our decoding scheme we only have an error when the label of , and the labels of all its two-hop-neighbors are . Hence the probability of error for estimating can be written as, Now let us estimate . We further condition the error on the event that there are distinct two-hop-neighbors (lets call the number of distinct neighbors of as ) and hence we have that . Now using the Poisson assumption we get the statement of the theorem. ∎

The performance of this querying scheme is plotted against the number of queries for prior probabilities in Figure 3.

Comparison with ‘same cluster’ queries: We see in Figure 3 that the AND query scheme beats the ‘same cluster’ query lower bound for a range of query-performance trade-off in approximate recovery for . For smaller , this range of values of increases further. If we randomly choose ‘same cluster’ queries and then resort to maximum likelihood decoding (note that, for AND queries, we present a simple decoding) then queries are still required even if we allow for proportion of incorrect labels (follows from [11]). The best performance for ‘same cluster’ query in approximate recovery that we know of for small is given by: (neglect elements and just query the remaining elements with the first element). However, such a scheme can be achieved by AND queries as well in a similar manner. Therefore, there is no point in the query vs plot that we know of where ‘same cluster’ query achievability outperforms AND query achievability.

### 3.3 Noisy queries and approximate recovery

This section contains our main algorithmic contribution. In contrast to the previous sections here we consider the general case of clusters. Recall that the label vector , and the prior probability of each label is given by the probabilities . Instead of binary output queries, in this part we consider an oracle that can provide one of different answers. We consider a model of noise in the query answer where the oracle provides correct answer with probability , and any one of the remaining incorrect answers with probability . Note that we do not allow the same query to be asked to the oracle multiple time (see Sec. 2 for justification). We also define a -good approximation scheme exactly as before.

Querying Scheme: We only perform pairwise queries. For a pair of labels and we define a query . For our algorithm we define the as

 Q(X,X′)={iif X=X′=i0otherwise. }

We can observe that for , this query is exactly same as the binary AND query that we defined in the previous section. In our querying scheme, we make a total of queries, for an integer We design a -regular graph where is the set of elements that we need to label. We query all the pairs of elements .

Under this querying scheme, we propose to use Algorithm 1 for reconstructions of labels.

###### Theorem 8.

The querying scheme with queries and Algorithm 1 is -good for approximate recovery of labels from noisy queries.

We can come up with more exact relation between number of queries , and . This is deferred to Section 7.

###### Proof of Theorem 8.

The total number of queries is . Now for a particular element , we look at the values of noisy oracle answers . We have, when the true label of is . When the true label is something else, . There is an obvious gap between these expectations. Clearly when the true label is , the probability of error in assignment of the label of is given by, for some constants and depending on the gap, from Chernoff bound. And when the true label is , the probability of error is , for some constants . Let , we can easily observe that scales as . Hence the total number of queries is .

The only thing that remains to be proved is that the number of incorrect labels is with high probability. Let be the event that element has been incorrectly labeled. Then . The total number of incorrectly labeled elements is . We have . Now define if and are dependent. Now because the maximum number of nodes dependent with are its -hop and -hop neighbors. Now using Corollary 4.3.5 in [2], it is evident that almost always. ∎

The theoretical performance guarantee of Algorithm 1 (a detailed version of Theorem 8 is in the supplementary material) for is shown in Figures 3 and 5. We can observe from Figure 3 that for a particular , incorrect labeling decreases as becomes higher. We can also observe from Figure 5 that if then the incorrect labeling is 50% because the complete information from the oracle is lost. For other values of , we can see that the incorrect labeling decreases with increasing .

We point out that ‘same cluster’ queries are not a good choice here, because of the symmetric nature of XOR due to which there is no gap between the expected numbers (contrary to the proof of Theorem 8) which we had exploited in the algorithm to a large extent.

Lastly, we show that Algorithm 1 can work without knowing the prior distribution and only with the knowledge of relative sizes of the clusters. The ground truth clusters can be adversarial as long as they maintain the relative sizes.

###### Theorem 9.

Suppose we have , the number of elements with label , , as input instead of the priors. By taking a random permutation over the nodes while constructing the d-regular graph, Algorithm 1 will be -good approximation with queries as when we set .

###### Proof.

Let us consider . The proof can be extended for general . We generate the -regular graph on a nodes is the following way:

1. If is even, put all the vertices around a circle, and join each to its nearest neighbors on either side.

2. If

is odd, and

is even, put the vertices on a circle, join each to its nearest neighbors on each side, and also to the vertex directly opposite.

Now suppose we are given and . Let us consider all random permutations of these sets of points on a circle. Fixing a node of label 1(say), it becomes a random permutation on a line by making a reference point. The probability that there are exactly neighbors of having label 1 among the neighbors is

 (dk)(n−d−1n1−k−1)(n−1n1−1)≈(dk)Πk−1i=0(n1−i)Πd−k−1i=0(n0−i)Πd−1i=0(n−i)≤(dk)(n1n)k(n0n−k)d−k.

All the rest of the conditional probabilities used in the analysis in Appendix 7 stays the same. Now, is just a constant and hence

and hence asymptotically this distribution is equivalent to the binomial distribution. Hence we can simply set

and then use the Algorithm 1. Our final probability of incorrect labeling is going to be . Thus for large its behavior is exactly the same as with using priors. ∎

## 4 Experiments

Though our main contribution is theoretical we have verified our work by using our algorithm on a real dataset created by local crowdsourcing. We first picked a list of 100 ‘action’ movies and 100 ‘romantic’ movies from IMDB (http://www.imdb.com/list/ls076503982/ and http://www.imdb.com/list/ls058479560/). We then created the queries as given in the querying scheme of Sec. 3.3 by creating a -regular graph (where is even). To create the graph we put all the movies on a circle and took a random permutation on them in a circle. Then for each node we connected edges on either side to its closest neighbors in the permuted circular list. This random permutation will allow us to use the relative sizes of the clusters as priors as explained in Sec. 3.3. Using , we have queries with each query being the following question: Are both the movies ‘action’ movies?. Now we divided these 1000 queries into 10 surveys (using SurveyMonkey platform) with each survey carrying 100 queries for the user to answer. We used 10 volunteers to fill up the survey. We instructed them not to check any resources and answer the questions spontaneously and also gave them a time limit of a maximum of 10 minutes. The average finish time of the surveys were 6 minutes. The answers represented the noisy query model since some of the answers were wrong. In total, we have found 105 erroneous answers in those 1000 queries. For each movie we evaluate the query answer it is part of, and use different thresholds for prediction. That is, if there are more than ‘yes’ answers among those

answers we classified the movie as ‘action’ movie and a ‘romantic’ movie otherwise.The theoretical threshold for predicting an ‘action’ movie is

for oracle error probability and . But we compared other thresholds as well. We now used Algorithm 1 to predict the true label vector from a subset of queries by taking edges for each node where and is even i.e . Obviously, for , the thresholds is meaningless as we always estimate the movie as ‘romantic’ and hence the distortion starts from in that case. We plotted the error for each case against the number of queries () and also plotted the theoretical distortion obtained from our results for labels and . We compare these results along with the theoretical distortion that we should have for . All these results have been compiled in Figure 5 and we can observe that the distortion is decreasing with the number of queries and the gap between the theoretical result and the experimental results is small for . These results validate our theoretical results and our algorithm to a large extent.

We have also asked ‘same cluster’ queries with the same set of 1000 pairs to the participants to find that the number of erroneous responses to be (whereas with AND queries it was 105). This substantiates the claim that AND queries are easier to answer for workers. Since this number of errors is too high, we have compared the performance of ‘same cluster’ queries with AND queries and our algorithm in a synthetically generated dataset with two hundred elements (Figure 7

). For recovery with ‘same cluster’ queries, we have used the popular spectral clustering algorithm with normalized cuts

[25]. The detailed results obtained can be found in Figure 7 below.

## 5 Proof of Theorem 3 and comparison with [23]

We have assumed that the label vector consists of i.i.d. Bernoulli random variables with We will define the typical set where denote the Hamming weight of the argument. We know that (See [6]) , with as . For a fixed vector in the typical set, let us denote as the number of vectors in the typical set that are at a distance from . Let us denote the weight of the vector by where . In that case, it is easy to see that 444This is for the case when is even. For odd , we will have . Let be defined as which is independent of the vector and suppose that the weight of the vector which maximizes is for some . For two binary vectors we will denote to be the Hamming distance between the two vectors. Now, we can rewrite the probability of error as below:

 Pe =∑X∈Typ(p)Pr(X)PrQ∼Q(Ψ(QX)≠X)+∑X∉Typ(p)Pr(X)PrQ∼Q(Ψ(QX)≠X) ≤∑X∈Typ(p)Pr(X)PrQ∼Q(Ψ(QX)≠X)+∑X∉Typ(p)Pr(X) =∑X∈Typ(p)Pr(X)PrQ∼Q(∃X′∈Typ(p)suchthatQX=QX′)+ϵn ≤∑X∈Typ(p)Pr(X)2np+2n2/3∑t=1∑X′∈Typ(p)∣dH(X,X′)=tPrQ∼Q(QX=QX′)+ϵn =∑X∈Typ(p)Pr(X)2np+2n2/3∑t=1At,XPrQ∼Q(QZ=0∣wt(Z)=t)+ϵn ≤2np+2n2/3∑t=1AtPrQ∼Q(QZ=0∣wt(Z)=t)+ϵn ≤2np+2n2/3∑todd,t=1AtPrQ∼Q(QZ=0∣wt(Z)=t)+2np+2n2/3∑teven,t=2AtPrQ∼Q(QZ=0∣wt(Z)=t)+ϵn ≤Podde+Pevene+ϵn,

where the first two terms in the sum are termed and respectively. Now, we will use the following Lemma from [23]

###### Lemma 2.

If is odd, then

otherwise, we will have the following two upper bounds:

Therefore, after substituting the values of , we can write:

 Pevene ≤βn∑teven,t=2(mtc2)(tc2m)tc∑j(np+s∗t2−j)(n−np−s∗t2+j)+ 2np+2n2/3∑teven,t=βnmΔ+12m(1+(1−2tn)Δ)m∑j(np+s∗t2+j)(n−np−s∗t2−j)

for some to be determined later. Let us define the function which takes a positive even number as input:

 f(x)=∑j(np+s∗x2−j)(n−np−s∗x2+j)(mcx2)(cx2m)cx.

Notice that

 f(x+2)f(x) =∑j(np+s∗x2−j+1)(n−np−s∗x2+j+1)(mcx2+c)(cx+2c2m)cx+2c∑j(np+s∗x2−j)(n−np−s∗x2+j)(mcx2)(cx2m)cx ≤(mcx2+c)(cx+2c2m)cx+2c(mcx2)(cx2m)cx⋅maxj(np+s∗x2−j+1)(n−np−s∗x2+j+1)(np+s∗x2−j)(n−np−s∗x2+j) =maxj(np+o(n)x/2−j+1⋅n−np−o(n)x/2+j+1)⋅∏ci=1(m−cx/2−i)∏ci=1(cx/2+i)⋅(cx+2c2m)cx+2c(cx2m)cx+2c−2c