Optimal Inference in Crowdsourced Classification via Belief Propagation

02/11/2016 ∙ by Jungseul Ok, et al. ∙ KAIST 수리과학과 University of Illinois at Urbana-Champaign 0

Crowdsourcing systems are popular for solving large-scale labelling tasks with low-paid workers. We study the problem of recovering the true labels from the possibly erroneous crowdsourced labels under the popular Dawid-Skene model. To address this inference problem, several algorithms have recently been proposed, but the best known guarantee is still significantly larger than the fundamental limit. We close this gap by introducing a tighter lower bound on the fundamental limit and proving that Belief Propagation (BP) exactly matches this lower bound. The guaranteed optimality of BP is the strongest in the sense that it is information-theoretically impossible for any other algorithm to correctly label a larger fraction of the tasks. Experimental results suggest that BP is close to optimal for all regimes considered and improves upon competing state-of-the-art algorithms.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Crowdsourcing platforms provide scalable human-powered solutions to labelling large-scale datasets at minimal cost. They are particularly popular in domains where the task is easy for humans but hard for machines, e.g., computer vision and natural language processing. For example, the CAPTCHA system

[1] uses a pair of scanned images of English words, one for authenticating the user and the other for the purpose of getting high-quality character recognitions to be used in digitizing books. However, because the tasks are tedious and the pay is low, one of the major issues is the quality of the labels. Errors are common even among those who put in efforts. In real-world systems, spammers are abundant, who submit random answers rather than good-faith attempts to label. There are adversaries deliberately giving wrong answers.

A common and powerful strategy to improve reliability is to add redundancy: assigning each task to multiple workers and aggregating their answers by some algorithm such as majority voting. Although majority voting is widely used in practice, several novel approaches, which outperform majority voting, have been recently proposed, e.g. [2, 3, 4, 5, 6]. The key idea is to identify the good workers and give more weights to the answers from those workers. Although the ground truths may never be exactly known, one can compare one worker’s answers to those from other workers on the same tasks, and infer how reliable or trustworthy each worker is.

The standard probabilistic model for representing the noisy answers in labelling tasks is the model introduced by Dawid and Skene in [7]

. Under this model, the core problem of interest is how to aggregate the answers to maximize the accuracy of the estimated labels. This is naturally posed as a statistical inference problem that we call the

crowdsourced classification problem. Due to the combinatorial nature of the problem, the Maximum A Posteriori (MAP) estimate is optimal but computationally intractable. Several algorithms have recently been proposed as approximations, and their performances are demonstrated only by numerical experiments. These include algorithms based on spectral methods [8, 9, 10, 11, 12], Belief Propagation (BP) [13]

, Expectation Maximization (EM)

[13, 14], maximum entropy [15, 16], weighted majority voting [17, 18, 19], and combinatorial approaches [20].

Despite the algorithmic advances, theoretical advances have been relatively slow. Some upper bounds on the performances are known [10, 14, 20], but fall short of answering which algorithm should be used in practice. In this paper, we ask the fundamental question of whether it is possible to achieve the performance of the optimal MAP estimator with a computationally efficient inference algorithm. In other words, we investigate the computational gap between what is information-theoretically possible and what is achievable with a polynomial time algorithm.

Our main result is that there is no computational gap in the crowdsourced classification problem for a broad range of problem parameters. Under some mild assumptions on the parameters of the problem, we show the following:

Belief propagation is exactly optimal.

To the best of our knowledge, our algorithm is the only computationally efficient approach that provably maximizes the fraction of correctly labeled tasks, achieving exact optimality.

Contribution. We consider binary classification tasks and identify regimes where the standard BP achieves the performance of the optimal MAP estimator. When each task is assigned enough number of workers, we prove that it is impossible for any other algorithm to correctly label a larger fraction of tasks than BP. This is the only known algorithm to achieve such a strong notion of optimality and settles the question of whether there is a computational gap in the crowdsourced classification problem for a broad range of parameters. We provide experimental results confirming the optimality of BP for both synthetic and real datasets.

The provable optimality of BP-based algorithms in graphical models with loops (such as those in our model) is known only in a few instances including community detection [21], error correcting codes [22]

and combinatorial optimization

[23]. Technically, our proof strategy for the optimality of BP is similar to that in [21] where another variant of BP algorithm is proved to be optimal to recover the latent community structure among users. However, our proof technique overcomes several unique challenges, arising from the complicated correlation among tasks that can only be represented by weighted and directed hyper-edges, as opposed to simpler unweighted undirected edges in the case of stochastic block models. This might be of independence interest in analyzing censored block models [24] with some directed observations.

Related work. The crowdsourced classification problem has been first studied in the dense regime, where all tasks are assigned all the workers [8, 14]

. In such dense regimes, as the problem size increases, each task receives increasing number of answers. Thus, previous work has studied the probability of labelling all tasks correctly

[8, 14].

In this paper, we focus on the sparse regime, where each task is assigned to a few workers. Suppose workers are assigned each task. In practical crowdsourcing systems, a typical choice of is three or five. For a fixed , the probability of error now does not decay with increasing dimension of the problem. The theoretical interest is focused on identifying how the error scales with , that represents how much redundancy should be introduced in the system. An upper bound that scales as (when for some that depends on the problem parameters) was proved by [10], analyzing a spectral algorithm that is modified to use the spectral properties of the non-backtracking operators instead of the usual adjacency matrices. This scaling order is also shown to be optimal by comparing it to the error rate of an oracle estimator. A similar bound was also proved for another spectral approach, but under more restricted conditions in [9]. Our main results provide an algorithm that (when for some constant depending on where we denote the number of tasks per worker by ) correctly labels the optimal fraction of tasks, in the sense that it is information-theoretically impossible to correctly label a larger fraction for any other algorithms.

These spectral approaches are popular due to simplicity, but empirically do not perform as well as BP. In fact, the authors in [13] showed that the state-of-the-art spectral approach proposed in [10] is a special case of BP with a specific choice of the prior on the worker qualities. Since the algorithmic prior might be in mismatch with the true prior, the spectral approach is suboptimal.

Organization. In Section 2, we provide necessary backgrounds including the Dawid-Skene model for crowdsourced classification and the BP algorithm. Section 3 provides the main results of this paper, and their proofs are presented in Section 4. Our experimental results on the performance of BP are reported in Section 6 and we conclude in Section 7.

2 Preliminaries

We describe the mathematical model and present the standard MAP and the BP approaches.

2.1 Crowdsourced Classification Problem

We consider a set of binary tasks, denoted by . Each task is associated with a ground truth . Without loss of generality, we assume ’s are independently chosen with equal probability. We let denote the set of workers who are assigned tasks to answer. Hence, this task assignment is represented by as a bipartite graph , where edge indicates that task is assigned to worker . For notational simplicity, let denote the set of tasks assigned to worker and conversely let denote the set of workers to whom task is assigned.

When task is assigned to worker , worker provides a binary answer , which is a noisy assessment of the true label . Each worker is parameterized by a reliability , such that each of her answers is correct with probability . Namely, for given , the answers

are independent random variables such that

We assume that the average reliability is greater than , i.e., .

This Dawid-Skene model is the most popular one in crowdsourcing dating back to [7]. The underlying assumption is that all the tasks share a homogeneous difficulty; the error probability of a worker is consistent across all tasks. We assume that the reliability ’s are i.i.d. according to a reliability distribution on

, described by a probability density function

.

For the theoretical analysis, we assume that the bipartite graph is drawn uniformly over all -regular graphs for some constants using, for example, the configuration model [25].111We assume constants for simplicity, but our results hold as long as . Each task is assigned to random workers and each worker is assigned random tasks. In real-world crowdsourcing systems, the designer gets to choose which graph to use for task assignments. Random regular graphs have been proven to achieve minimax optimal performance in [10], and empirically shown to have good performances. This is due to the fact that the random graphs have large spectral gaps.

2.2 MAP Estimator

Under this crowdsourcing model with given assignment graph and reliability distribution , our goal is to design an efficient estimator of the unobserved true answers from the noisy answers reported by workers. In particular, we are interested in the optimal estimator minimizing the (expected) average bit-wise error rate, i.e.,

(1)

where we define

The probability is taken with respect to and for given and . From standard Bayesian arguments, the maximum a posteriori (MAP) estimator is an optimal solution of (1):

(2)

However, this MAP estimate is challenging to compute, as we show below. Note that

(3)
(4)
(5)

where is the number of the tasks assigned to worker and is the number of the correct answers from worker . Then,

(6)
(7)

where we let denote the local factor associated with worker . We note that the factorized form of the joint probability of in (7) corresponds to a standard graphical model with a factor graph that represents the joint probability of given , where each task and each worker correspond to the random variable and the local factor , respectively, and the edges in indicate couplings among the variables and the factors.

The marginal probability in the optimal estimator is calculated by marginalizing out from (7), i.e.,

(8)
(9)

We note that the summation in (9) is taken over exponentially many with respect to . Thus in general, the optimal estimator , which requires to obtain the marginal probability of given in (2), is computationally intractable due to the exponential complexity in (9).

2.3 Belief Propagation

Recalling the factor graph described by (7), the computational intractability in (9

) motivates us to use a standard sum-product belief propagation (BP) algorithm on the factor graph as a heuristic method for approximating the marginalization. The BP algorithm is described by the following iterative update of messages

and between task and worker and belief on each task :

(10)
(11)
(12)

where the belief is the estimated marginal probability of given . We here initialize messages with a trivial constant and normalize messages and beliefs, i.e., . Then at the end of iterations, we estimates the label of task as follows:

(13)

We note that if the factor graph is a tree, then it is known that the belief converges, and computes the exact marginal probability [26].

Property 1.

If assignment graph is a tree so that the corresponding factor graph is a tree as well, then

where is iteratively updated by BP in creftypeplural 12, 11 and 10.

However, for general graphs which may have loops, e.g., random -regular graphs, BP has no performance guarantee, i.e., BP may output . Further the convergence of BP is not guaranteed, i.e., the value of may not exist.

3 Performance Guarantees of BP

In this section, we provide the theoretical guarantees on the performance of BP. To this end, we consider the output of BP in (13) with a choice of . It follows that the overall complexity of BP is bounded by as each iteration of BP requires operations [13].

3.1 Exact Optimality of BP for large

We show in the following that BP is asymptotically optimal under a mild assumption that each task is assigned to sufficiently large (but constant with respect to the number of tasks) number of workers, i.e., . This follows from a tighter bound in the non-asymptotic regime, where we upper bound the optimality gap, exponentially vanishing in the number of iterations . We present both results in the following theorem.

Theorem 1.

Consider the Dawid-Skene model under the task assignment generated by a random bipartite -regular graph consisting of tasks and workers. Let denote the output of BP in (13) after iterations. For , , and , there exists a constant that only depends on and such that if , then for sufficiently large :

(14)

where the expectation is taken with respect to the graph .

As a corollary, it follows that when we set increasing with , for example , we have asymptotic optimality:

(15)

A proof is provided in Section 4.1. Our analysis compares BP to an oracle estimator. This oracle estimator not only has access to the observed crowdsourced labels, but also the ground truths of a subset of tasks. Given this extra information, it performs the optimal estimation, outperforming any algorithm that operates only on the observations. Using the fact that the random -regular bipartite graph has a locally tree-like structure [25] and BP is exact on the local tree [26], we prove that the performance gap between BP and the oracle estimator vanishes due to decaying correlation from the information on the outside of the local tree to the root. This establishes that the gap between BP and the best estimator vanishes, in the large system limit.

The assumption on is mild, since it only requires that the crowd as a whole can distinguish what the true label is. In the case , one can flip the sign of the final estimate to achieve the same guarantee. It is more intuitive to understand this assumption as formally defining a ground truths, as what the majority crowd would agree on (on average) if we asked the same question to all the workers in the crowd. Hence, this assumption is without loss of generality.

The assumption on is mild, as the only case when is if is a a binary random variable taking values only in . In such cases, every worker is either telling the exact truths consistently or exact the opposite of the truths. It follows from Perron-Frobenius theorem [27] that a naive spectral method would work (and so does several other simple techniques). However, BP messages are not smooth in this case, which is required for our analysis. We believe optimality of BP still holds but requires a different analysis technique.

Although practically, BP works well in all regimes of parameters as suggested in Section 6, theoretically, we require require to ensure that the graph is locally tree-like within the neighborhood of depth . Analysis of BP beyond is an open problem, also in other applications such as community detection [21].

When , there is nothing to learn about the workers and simple majority voting is also the optimal estimator. BP also reduces to majority voting in this case, achieving the same optimality, and in fact . The interesting non-trivial case is when . The sufficient condition is for to be larger than some . Although experimental results in Section 6 suggest that BP is optimal in all regimes considered, proving optimality for requires new analysis techniques, beyond those we develop in this paper. The problem of analyzing BP for (sample sparse regime) is challenging. Similar challenges have not been resolved even in a simpler models222The stochastic block model is simpler than our model in the sense that it has only pair-wise factors which is the special case of our model with . of stochastic block models, where BP and other efficient inference algorithms have been analyzed extensively [21, 28].

3.2 Relative Dominance of BP for small

For general and , we establish the dominance of BP over two existing algorithms with known guarantees: the majority voting (MV) and the state-of-the-art iterative algorithm (KOS) in [10]. In the sparse regime, where , these are the only existing algorithms with tight provable guarantees.

Theorem 2.

Consider the Dawid-Skene model under the task assignment generated by a random bipartite -regular graph consisting of tasks and workers. Let and denote the outputs of MV and KOS algorithms, respectively. Then, for any such that ,

where is the output of BP in (13) with and the expectations are taken with respect to the graph .

A proof of the above theorem is presented in Section 4.2. Using Theorem 2 and the known error rates of MV and KOS algorithms in [10], one can derive the following upper bound on the error rate of BP:

(16)

where and all the parameters , and can depend on .

This is particularly interesting, since it has been observed empirically and conjectured with some non-rigorous analysis in [12] that there exists a threshold , above which KOS dominates over MV, and below which MV dominates over KOS (see Figure 2

). This is due to the fact that KOS is inherently a spectral algorithm relying on the singular vectors of a particular matrix derived from

. Below the threshold, the sample noise overwhelms the signal in the spectrum of the matrix, which is known as the spectral barrier, and spectral methods fail. However, in practice, it is not clear which of the two algorithms should be used, since the threshold depends on latent parameters of the problem. Our dominance result shows that one can safely use BP, since it outperforms both algorithms in both regimes governed by the threshold. This is further confirmed by numerical experiments in Figure 2.

4 Proofs of Theorems

In this section, we provide the proofs of Theorems 1 and 2.

4.1 Proof of Theorem 1

We first consider the case . Then, is the set of disjoint one-level trees, i.e., star graphs, where the root of each tree corresponds to task and the leaves are the set of workers assigned to the task . Since the graphs are disjoint, we have , where and . From Property 1, it follows that

Therefore, for any , the optimal MAP estimator in (2) is identical to the output with any .

From now on, we focus on the case , and we condition on a fixed task assignment graph . Define as a random node chosen uniformly at random and let denote the gain of estimator compared to random guessing, i.e.,

where the expectation is taken with respect to the distribution of . Then it is enough to show that and converge to the same value, i.e., the limit value of exists and as ,

(17)

where the expectation is taken with respect to the distribution of .

To this end, we introduce two estimators, and , which have accesses to different amounts and types of information. Let denote the subgraph of induced by all the nodes within (graph) distance from root and denote the set of (task) nodes333Since is a bipartite graph, the distance from task to every task is even and the distance from task

to every worker is odd.

whose distance from is exactly . We now define the following oracle estimator:

where we denote

(18)

We note that uses the exact label information of separating the inside and the outside of . Hence one can show that outperforms the optimal estimator . We formally provide the following lemma whose proof is given in Section 5.1.

Lemma 1.

Consider the Dawid-Skene model with the task assignment corresponding to and let denote the set of workers’ labels. For and ,

Conversely, if an estimator uses less information than another, it performs worse. Formally, we provide the following lemma whose proof is given in Section 5.2.

Lemma 2.

Consider the Dawid-Skene model with the task assignment corresponding to and let denote the set of workers’ labels. For any and subset ,

On estimating task , BP at -th iteration on is identical to BP on . If is a tree, then from Property 1, BP calculates the exact marginal probability of given , i.e.,

Thus, if is a tree, then using Lemmas 1 and 2 with , we have that

(19)
(20)

where we define

Consider now a random -regular bipartite graph , which is a locally tree-like. More formally, from Lemma 5 in [12], if follows that

(21)

Hence, by taking the expectation with respect to and applying (21) to (20), we get

(22)
(23)

where the last term in the RHS is less than for sufficiently large since we set and . In addition, from the following lemma, the first term in the RHS is also less than . Hence, this implies (17) and the existence of the limit of due to the bounded and non-increasing sequence of in Lemma 1. We complete the proof of Theorem 1.

Lemma 3.

Suppose is a tree of which root is task and depth is , where every task except the leaves is assigned to workers and every worker labels two tasks. For a given , there exists a constant such that if , then for sufficiently large ,

(24)

A rigorous proof of Lemma 3 is given in Section 5.3. Here, we briefly provide the underlying intuition on the proof. As long as is strictly greater than and is sufficiently large, the majority voting of the one-hop information can achieve high accuracy. On the other hand, intuitively the information in two or more hops is less useful. In the proof of Lemma 3, we also provide a quantification of the decaying rate of the correlation from the information on to as the distance increases.

4.2 Proof of Theorem 2

We note that that KOS is an iterative algorithm where for each and , depends on only defined in (18). In addition, it is clear that MV uses only one-hop information . Hence for given , the MAP estimator outperforms MV and KOS, i.e.,

(25)

Recall that if is a tree, we have . Similarly to (23), by taking the expectation with respect to , it follows that

where the last term goes as if and . This completes the proof of Theorem 2.

5 Proofs of Lemmas

5.1 Proof of Lemma 1

We start with the conditional probability of error given in the following:

This directly implies that

(26)

Then, by simple algebra, it follows that

where for the last equality we use

Let denote the distribution of given , and let be the distribution of given , i.e.,

Then we have a simple expression of as follows:

(27)

where we let denotes the total variation distance, i.e., for distributions and on the same space , we define

Next we note that since blocks every path from the outside of to , the information on the outside of , , is independent of given , i.e.,

(28)

Hence if we set to be the distribution of and given and similarly for , we have

Noting that and can be obtained by marginalizing out in and , it follows that

(29)
(30)
(31)
(32)

which implies .

We now study with different . Observe that blocks every path from to , i.e., is independent of given . Thus from (28) it follows that

Therefore, and can be obtained from and by marginalizing out . Similarly to (32), we have

which completes the proof of Lemma 1.

5.2 Proof of Lemma 2

The proof of Lemma 2 is analog to that of Lemma 1. Let be the distribution of given and be the distribution of given , i.e.,

Since and can be obtained by marginalizing out from and in (27), using the same logic for (32), we have

which completes the proof of Lemma 2.

5.3 Proof of Lemma 3

We start with several notations which we use in the proof. For , let be the subtree rooted from including all the offsprings of in tree . We let denote the leaves in and . Define

Here is often called the magnetization of given . Similarly, given and , we define the biased magnetization :

Using the alternative expression of in (26), one can check that

where the expectation is taken with respect to and .

Next, for , we define to be a random node chosen uniformly at random so that is a leaf node in , i.e., thus , and is the root , i.e., . Therefore it is enough to show that for each

(33)

since this implies

(34)

and hence as . Here quantifies the correlation from the information at the leaves to . We will show that the correlation exponentially decays with respect to in what follows.

Figure 1: A graphical representation of notations: and .

To do so we study certain recursions describing relations among and . Let be the set of all the offspring of and be the set of all the offspring of in tree , i.e., and . (See Figure 1 for a graphical explanation of the notations.) Also, define and such that . Then in (7) can be expressed as follows:

where the expectation is taken for . Also, using the above expression of and the fact that , we first write the marginal probability of given and :