# A Streaming Algorithm for Crowdsourced Data Classification

We propose a streaming algorithm for the binary classification of data based on crowdsourcing. The algorithm learns the competence of each labeller by comparing her labels to those of other labellers on the same tasks and uses this information to minimize the prediction error rate on each task. We provide performance guarantees of our algorithm for a fixed population of independent labellers. In particular, we show that our algorithm is optimal in the sense that the cumulative regret compared to the optimal decision with known labeller error probabilities is finite, independently of the number of tasks to label. The complexity of the algorithm is linear in the number of labellers and the number of tasks, up to some logarithmic factors. Numerical experiments illustrate the performance of our algorithm compared to existing algorithms, including simple majority voting and expectation-maximization algorithms, on both synthetic and real datasets.

## Authors

• 10 publications
• 11 publications
• ### Error Rate Bounds in Crowdsourcing Models

Crowdsourcing is an effective tool for human-powered computation on many...
07/10/2013 ∙ by Hongwei Li, et al. ∙ 0

• ### Error Rate Bounds and Iterative Weighted Majority Voting for Crowdsourcing

Crowdsourcing has become an effective and popular tool for human-powered...
11/15/2014 ∙ by Hongwei Li, et al. ∙ 0

• ### Ten Steps of EM Suffice for Mixtures of Two Gaussians

The Expectation-Maximization (EM) algorithm is a widely used method for ...
09/01/2016 ∙ by Constantinos Daskalakis, et al. ∙ 0

• ### Streaming Bayesian Inference for Crowdsourced Classification

A key challenge in crowdsourcing is inferring the ground truth from nois...
11/13/2019 ∙ by Edoardo Manino, et al. ∙ 0

• ### Stochastic Canonical Correlation Analysis

We tightly analyze the sample complexity of CCA, provide a learning algo...
02/21/2017 ∙ by Chao Gao, et al. ∙ 0

• ### Logarithmic Time One-Against-Some

We create a new online reduction of multiclass classification to binary ...
06/15/2016 ∙ by Hal Daumé III, et al. ∙ 0

• ### An analytic formulation for positive-unlabeled learning via weighted integral probability metric

We consider the problem of learning a binary classifier from only positi...
01/28/2019 ∙ by Yongchan Kwon, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The performance of most machine learning techniques, and in particular data classification, strongly depends on the quality of the labeled data used in the initial training phase. A common way to label new datasets is through crowdsourcing: many people are asked to label data, typically texts or images, in exchange of some low payment. Of course, crowdsourcing is prone to errors due to the difficulty of some classification tasks, the low payment per task and the repetitive nature of the job. Some labellers may even introduce errors on purpose. Thus it is essential to assign the same classification task to several labellers and to learn the competence of each labeller through her past activity so as to minimize the overall error rate and to improve the quality of the labeled dataset.

Learning the competence of each labeller is a tough problem because the true label of each task, the so-called “ground-truth”, is unknown (it is precisely the objective of crowdsourcing to guess the true label). Thus the competence of each labeller must be inferred from the comparison of her labels on some set of tasks with those of other labellers on the same set of tasks.

In this paper, we consider binary labels and propose a novel algorithm for learning the error probability of each labeller based on the correlations of the labels. Specifically, we infer the error probabilities of the labellers from their agreement rates

, that is for each labeller the proportion of other labellers whom agree with her. A key feature of this agreement-based algorithm is its streaming nature: it is not necessary to store the labels of all tasks, which may be expensive for large datasets. Tasks can be classified on the fly, which simplifies the implementation of the algorithm. The algorithm can also be easily adapted to non-stationary environments where the labeller error probabilities evolve over time, due for instance to the self-improvement of the labellers or to changes in the type of data to label. The complexity of the algorithm is linear, up to some logarithmic factor.

We provide performance guarantees of our algorithm for a fixed population of labellers, assuming each labeller works on each task with some fixed probability and provides the correct label with some other fixed, unknown probability, independently of the other labellers. In particular, we show that our algorithm is optimal in terms of cumulative regret, namely the number of labels that are different from those given by the optimal decision, assuming the labeller error rates are perfectly known, is finite, independently of the number of tasks. We also propose a modification of the algorithm suitable for non-stationary environments and provide performance guarantees in this case as well. Finally, we compare the performance of our algorithm to those of existing algorithms, including simple majority voting and expectation-maximization algorithms, through numerical experiments using both synthetic and real datasets.

The rest of the paper is organized as follows. We present the related work in the next section. We then describe the model and the proposed algorithm. Section 4 is devoted to the performance analysis and Section 5 to the adaptation of the algorithm to non-stationary environments. The numerical experiments are presented in Section 6. Section 7 concludes the paper.

## 2 Related Work

The first problems of data classification using independent labellers appeared in the medical context, where each label refers to the state of a patient (e.g., sick or sane) and the labellers are clinicians. In [4]

, Dawid and Skene proposed an expectation-maximization (EM) algorithm, admitting that the accuracy of the estimate was unknown. Several versions and extensions of this algorithm have since been proposed and tested in various settings

[8, 18, 1, 16, 12], without any significant progress on the theoretical side. Performance guarantees have been provided only recently for an improved version of the algorithm relying on spectral methods in the initialization phase [22].

A number of Bayesian techniques have also been proposed and applied to this problem, see [16, 20, 9, 12, 11, 10] and references therein. Of particular interest is the belief-propagation (BP) algorithm of Karger, Oh and Shah [9], which is provably order-optimal in terms of the number of labellers required per task for any given target error rate, in the limit of an infinite number of tasks and an infinite population of labellers.

Another family of algorithms is based on the spectral analysis of some matrix representing the correlations between tasks or labellers. Gosh, Kale and McAfee [5] work on the task-task matrix whose entries correspond to the number of labellers having labeled two tasks in the same manner, while Dalvi et al. [3]

work on the labeller-labeller matrix whose entries correspond to the number of tasks labeled in the same manner by two labellers. Both obtain performance guarantees by the perturbation analysis of the top eigenvector of the corresponding expected matrix. The BP algorithm of Karger, Oh and Shah is in fact closely related to these spectral algorithms: their message-passing scheme is very similar to the power-iteration method applied to the task-labeller matrix, as observed in

[9].

A recent paper proposes an algorithm based on the notion of minimax conditional entropy [23], based on some probabilistic model jointly parameterized by the labeller ability and the task difficulty. The algorithm is evaluated through numerical experiments on real datasets only; no theoretical results are provided on the performance and the complexity of the algorithm.

All these algorithms require the storage of all labels in memory. To our knowledge, the only streaming algorithm that has been proposed for crowdsourced data classification is the recursive EM algorithm of Wang et al. [19], for which no performance guarantees are available.

Some authors consider slightly different versions of our problem. Ho et al. [7, 6] assume that the ground truth is known for some tasks and use the corresponding data to learn the competence of the labellers in the exploration phase and to assign tasks optimally in the exploitation phase. Liu and Liu [13] also look for the optimal task assignment but without the knowledge of any true label: an iterative algorithm similar to EM algorithms is used to infer the competence of each labeller, yielding a cumulative regret in for tasks compared to the optimal decision. Finally, some authors seek to rank the labellers with respect to their error rates, an information which is useful for task assignment but not easy to exploit for data classification itself [2, 15].

## 3 Model and Algorithm

### 3.1 Model

Consider labellers, for some integer . Each task consists in determining the answer to a binary question. The answer to task , the “ground-truth”, is denoted by

. We assume that the random variables

are i.i.d. and centered, so that there is no bias towards one of the answers.

Each labeller provides an answer with probability . When labeller provides an answer, this answer is incorrect with probability , independently of other labellers: is the error rate of labeller , with if labeller is perfectly accurate, if labeller is non-informative and if labeller always gives the wrong answer. We denote by

the vector

.

Denote by the output of labeller for task , where the output corresponds to the absence of an answer. We have:

 Xi(t)=⎧⎨⎩G(t) w.p. α(1−pi),−G(t) w.p. αpi,0 w.p. 1−α.

Since the labellers are independent, the random variables are independent given , for each task . We denote by the corresponding vector. The goal is to estimate the ground-truth as accurately as possible by designing an estimator that minimizes the error probability . The estimator is adaptive and may be a function of and the parameter (which is assumed known), but cannot depend on which is a latent parameter in our setting.

### 3.2 Weighted majority vote

It is well-known that, given and , an optimal estimator of is the weighted majority vote [14, 17], namely

 ^G(t)=1{W(t)>0}−1{W(t)<0}+Z1{W(t)=0}, (1)

where , is the weight of labeller (possibly infinite), and is a Bernoulli random variable of parameter over (for random tie-breaking). We provide a proof that accounts for the fact that labellers may not provide an answer for each task.

###### Proposition 1

Assuming is known, the estimator (1) is an optimal estimator of .

Proof. Finding an optimal estimator of amounts to finding an optimal statistical test between hypotheses and

, under a symmetry constraint so that type I and type II error probability are equal. Consider a sample

and denote by and its likelihood under hypotheses and , respectively. We have

 L+(x) =n∏i=1(αpi)1{xi=−1}(α(1−pi))1{xi=1}(1−α)1{xi=0}, L−(x) =n∏i=1(αpi)1{xi=1}(α(1−pi))1{xi=−1}(1−α)1{xi=0}.

We deduce the log-likelihood ratio,

 log(L+(x)L−(x))=n∑i=1wixi=wTx.

By the Neyman-Pearson theorem, for any level of significance, there exists and such that the uniformly most powerful test for that level is:

 1{wTx>a}−1{wTx

where is a Bernoulli random variable of parameter over . By symmetry, we must have and , which is the announced result.

This result shows that estimating the true answer reduces to estimating the latent parameter , which is the focus of the paper.

### 3.3 Average error probability

A critical parameter for the estimation of is the average error probability,

 q=1nn∑i=1pi.

We assume the following throughout the paper:

###### Assumption 1

We have .

This assumption is essential. First, it is necessary to assume that , i.e., labellers say “mostly the truth”. Indeed, the transformation does not change the distribution of , meaning that the parameters and are statistically indistinguishable: it is the assumption that breaks the symmetry of the problem and allows one to distinguish between true and false answers.

Next, the accurate estimation of requires that there is enough correlation between the labellers’ answers. Taking for instance, the mean error rate is but the estimation of is impossible since any permutation of the indices of lets the distribution of unchanged. For , the average error probability becomes , the maximum value allowed by Assumption 1, and the estimation becomes feasible.

### 3.4 Prediction error rate

Before moving to the estimation of , we give upper bounds on the prediction error rate, that is the probability that , given some estimator of .

First consider the case , which is a natural choice when nothing is known about . The corresponding weights are then equal and the estimator boils down to majority voting. We get

 P(^G(t)≠G(t))≤P(n∑i=1Xi(t)≤0|G(t)=1)≤exp(−n2(α(1−2q))2),

the second inequality following from Hoeffding’s inequality. For any fixed , the prediction error probability decreases exponentially fast with .

Now let . The corresponding weights are finite and the estimate follows from weighted majority voting. Again,

 P(^G(t)≠G(t))≤P(n∑i=1^wiXi(t)≤0|G(t)=1)≤exp(−12(α∑ni=1^wi(1−2pi))2∑ni=1^w2i),

the second inequality following from Hoeffding’s inequality.

Consider for instance the “hammer-spammer” model where and , i.e., half of the labellers always tell the truth while the other half always provide random answers. We obtain upper bounds on the prediction error rate equal to for and for . Taking for instance, we obtain respective bounds on the prediction error rate equal to and : assuming these bounds are tight, this means that the accurate estimation of may decrease the prediction error rate by an order of magnitude.

### 3.5 Agreement-based algorithm

#### Maximum likelihood

We are interested in designing an estimator of which has low complexity and may be implemented in a streaming fashion. The most natural way of estimating would be to consider the true answers as latent parameters, and to calculate the maximum likelihood estimate of given the observations . The likelihood of a sample given is

 t∏s=1(L+(x(s))1{g(s)=+1}+L−(x(s))1{g(s)=−1}).

This approach has two drawbacks. First, there is no simple sufficient statistic, so that one must store the whole sample , which incurs a memory space of and prevents any implementation through a streaming algorithm. Second, the likelihood is expressed as a product of sums, so that the maximum likelihood estimator is hard to compute, and one must rely on iterative methods such as EM.

#### Agreement rates

We propose instead to estimate through the vector of agreement rates. We define the agreement rate of labeller as the average proportion of other labellers whom agree with , i.e.,

 ai =1n−1∑j≠iP(Xi(t)Xj(t)=1|Xi(t)Xj(t)≠0), =1n−1∑j≠i(pipj+(1−pi)(1−pj)). (2)

Observe that , with if labeller never agrees with the other labellers and if labeller always agrees with the other labellers.

Using the average error probability , we get

 ai=1n−1(pi(nq−pi)+(1−pi)(n−1−nq+pi)),

so that

 2p2i−2pi(n(q−12)+1)+nq−(1−ai)(n−1)=0. (3)

For any fixed and , we see that is a solution to a quadratic equation; in view of Assumption 1, this is the unique non-negative solution to this equation.

#### Fixed-point equation

For any and , let

 δi(u,v)=v+4n−1n2(1−2ui).

Observe that this is the discriminant of the quadratic equation (3) for and . It is non-negative whenever , with

 v0(u)=max(4n−1n2maxi=1,…,n(2ui−1),0).

Define the function by

 ∀u,∀v≥v0(u),f(u,v)=(1n−2n∑i=1√δi(u,v))2.
###### Proposition 2

The mapping is strictly increasing over .

Proof. For any and , we have for all , so that is differentiable and its partial derivative is:

 ∂f∂v(u,v) =1(n−2)2(n∑i=1√δi(u,v))(n∑i=11√δi(u,v)).

Using Fact 1, we obtain

 ∂f∂v(u,v)≥n2(n−2)2>1.

###### Fact 1

For any positive real numbers ,

 (n∑i=1χi)(n∑i=11χi)≥n2.

Proof.

This is another way to express the fact that the arithmetic mean is greater than or equal to the harmonic mean.

In view of Proposition 2, there is at most one solution to the fixed-point equation over , and this solution exists if and only if

 f(u,v0(u))≤v0(u). (4)

Moreover, the solution can be found by a simple binary search algorithm.

Now let be the function defined by

 ∀u,∀v≥v0(u),gi(u,v)=12+n4(√δi(u,v)−√v).

For any that satisfies (4), we define

###### Proposition 3

The unique solution to the fixed-point equation is . Moreover, we have and .

Proof. Let . It can be readily verified from (3) that . It then follows from Assumption 1 that and thus . Moreover,

 v=(1−2nn∑i=1pi)2=(1−2nn∑i=1gi(a,v))2=(12n∑i=1√δi(a,v)−n2√v)2,

so that, taking the square root of both terms, satisfies the fixed-point equation . This shows that and .

#### Estimator

Proposition 3 suggests that it is sufficient to estimate in order to retrieve . We propose the following estimate of ,

 ^ai(t)=t−1t^ai(t−1)+1t(n−1)α2∑j≠i1{Xi(t)Xj(t)=1}, (5)

with for all . Note that

 ^ai(t)=1t(n−1)α2t∑s=1∑j≠i1{Xi(s)Xj(s)=1}, (6)

so that is the empirical average of the number of labellers whom agree with for tasks . We use the definition (5) to highlight the fact that can be computed in a streaming fashion.

The time complexity of the update (5) is per task. Using the fact that over , we can in fact update the estimator as follows,

 ^ai(t)=t−1t^ai(t−1)+Xi(t)S(t)+|Xi(t)|(|N(t)|−2)2t(n−1)α2,

where is the sum of the labels of task and is the total number of actual labellers for task . The time complexity of the update is then per task.

#### Algorithm

Given this estimation of the vector of agreement rates, our estimation of the vector of error probabilities is

• if the fixed-point equation has a unique solution,

• otherwise.

We denote by the corresponding weight vector, with for all . These weights inferred from tasks are used to label task according to weighted majority vote, as defined by (1). We refer to this algorithm as the agreement-based (AB) algorithm.

## 4 Performance guarantees

In this section, we provide performance guarantees for the AB algorithm, both in terms of statistical error and computational complexity, and show that its cumulative regret compared to an oracle that knows the latent parameter is finite, for any number of tasks.

### 4.1 Accuracy of the estimation

Let . This is a fixed parameter of the model. Observe that in view of Proposition 3 and the fact that . Theorem 1, proved in the Appendix, gives a concentration inequality on the estimation error at time (that is, after having processed tasks ). We denote by the norm in .

###### Theorem 1

For any ,

 P(||^p(t)−p||∞≥ε) ≤2nexp(−γ3α48tε2).
###### Corollary 1

The estimation error is of order

 ||^p(t)−p||∞ =O⎛⎜⎝1γ32α2√lognt⎞⎟⎠.

As shown by Corollary 1, Theorem 1 yields the error rate of our algorithm in the regime where and are fixed and

, but is much stronger than what one may obtain through an asymptotic analysis. Indeed, for any values of

and , Theorem 1 shows that the mean estimation error exhibits sub-Gaussian concentration, and directly yields confidence regions for the vector . This may useful for instance in a slightly different setting where the number of samples is not fixed, and one must find a stopping criterion ensuring that the estimation error is below some target accuracy. An example of this setting arises when one attempts to identify the best labellers under some constraint on the number of samples.

### 4.2 Complexity

In order to calculate , we only need to store the value of , which requires memory space. Further, we have seen that the update of requires operations. For any computing the fixed point (using a binary search) up to accuracy requires operations. The accuracy of our estimate is (omitting the factors and ), so that one should use . The time complexity of our algorithm is then . It is noted that any estimator of requires at least space and time, since one has to store at least one statistic per labeller, and each component of must be estimated. Therefore the complexity of the AB algorithm is optimal (up to logarithmic factors) in both time and space.

### 4.3 Regret

The regret is a performance metric that allows one to compare any algorithm to the optimal decision knowing the latent parameter , given by some oracle. We define two notions of regret. The simple regret is the difference between the prediction error rate of our algorithm and that of the optimal decision for task . By Proposition 1, the optimal decision follows from weighted majority voting with weights given by the oracle; we denote by the corresponding output for task . The simple regret is then

 r(t)=P(^G(t)≠G(t))−P(G⋆(t)≠G(t)).

The second performance criterion is the cumulative regret, , that is the difference between the expected number of errors done by our algorithm and that of the optimal decision, for tasks .

Let and The following result, proved in the Appendix, shows that the cumulative regret of the AB algorithm is finite.

###### Theorem 2

Assume that . We have

 r(t) ≤2nexp(−γ3α4c28t),

with , and

 R(t)≤16nγ3α4c2.

## 5 Non-Stationary Environment

We have so far assumed a stationary environment so that the latent parameters stay constant over time. We shall see that, due to its streaming nature, our algorithm is also well-suited to non-stationary environments. In practice, the vector of error probabilities may vary over time due to several reasons, including:

• Classification needs: The type of data to label may change over time depending on the customers of crowdsourcing and the market trends.

• Learning: Most tasks (e.g., recognition of patterns in images, moderation tasks, spam detection) have a learning curve, and labellers become more reliable as they label more tasks.

• Aging: Some tasks require knowledge about the current situation (e.g., recognizing trends, analysis of the stock market) so that highly reliable labellers may become less accurate if they do not keep themselves up to date.

• Dynamic population: The population of labellers may change over time. While we assume that the total number of labellers is fixed, some labellers may periodically leave the system and be replaced by new labellers.

### 5.1 Model and algorithm

We assume that the number of labellers does not change over time but that varies with time at speed , so that for each labeller ,

 |pi(t)−pi(s)|≤σ|t−s|,∀t,s≥1.

We propose to adapt our algorithm to non-stationary environments by replacing empirical averages with exponentially weighted averages. Specifically, given an averaging parameter, we define the estimate of the vector of agreement rates at time by

 ^aβi(t)=(1−β)^aβi(t−1)+βXi(t)S(t)+|Xi(t)|(|N(t)|−2)2(n−1)α2. (7)

with for all . As in the stationary case, the estimate can be calculated as a function of and the sample in time, which fits the streaming setting. One may readily check that:

 ^aβi(t)=t∑s=1β(1−β)t−s(n−1)α2∑j≠i1{Xi(s)Xj(s)=1}. (8)

### 5.2 Performance guarantees

As in the stationary case, we derive concentration inequalities. Observe that the parameter now varies over time. The proof of Theorem 3 is given in the appendix.

###### Theorem 3

Assume that . Then for all ,

###### Corollary 2

The estimation error is of order :

 ||^p(t)−p(t)||∞=O⎛⎜⎝1γ(t)32(√βlognα2+σβ)⎞⎟⎠.

The expression of the estimation error shows that choosing

involves a bias-variance tradeoff, where the variance term is proportional to

and the bias term is proportional to . We derive the order of the optimal value of minimizing the estimation error of our algorithm. This is of particular interest in the slow-variation regime , since in most practical situations the environment evolves slowly (e.g., at the timescale of hundreds of tasks).

###### Corollary 3

Letting , the estimation error is of order

 ||^p(t)−p(t)||∞=O⎛⎜⎝σ13(logn)3α43γ(t)32⎞⎟⎠.

## 6 Numerical Experiments

In this section, we investigate the performance of our Agreement-Based (AB) algorithm on both synthetic data, in stationary and non-stationary environments, and real-world datasets.

### 6.1 Stationary environment

We start with synthetic data in a stationary environment. We consider a generalized version of the hammer-spammer model with an even number of labellers , the first half of the labellers being identical and informative and the second half of the labelers being non-informative, so that for and otherwise.

Figure 1 shows the estimation error on with respect to the number of tasks . There are labellers, all working on all tasks (that is ) and various values of the average error probability . The error is decreasing with in ) and increasing with , as expected: the problem becomes harder as approaches , since labellers become both less informative and less correlated.

Figure 2 shows the average estimation error of our algorithm for tasks as a function of the number of labellers . We compare our algorithm with an oracle which knows the values of the truth (note that this is different from the oracle used to define the regret, which knows the parameter and must guess the truth ). This estimator (which is optimal) simply estimates by the empirical probability that labeller disagrees with the truth. Interestingly, when increases, the error of our algorithm approaches that of the oracle, showing that our algorithm is nearly optimal.

On Figure 3 we present the impact of the answer probability on the estimation error, for labellers. As expected, the estimation error decreases with . The dependency is approximately linear, which suggests that our upper bound on the estimation error given in Corollary 1, which is inversely proportional to , can be improved.

On Figure 4 we present the cumulative regret with respect to the number of tasks , for labellers and different values of the average error probability . As for the estimation error, the cumulative regret increases with , so that the problem becomes harder as approaches , as expected. We know from Theorem 2 that this cumulative regret is finite, for any that satisfies Assumption 1 (here, ). We observe that this regret is suprisingly low: for , the cumulative regret is close to 0, meaning that there is practically no difference with the oracle, which knows perfectly the parameter ; for , our algorithm makes less than prediction errors on average compared to the oracle.

### 6.2 Non-stationary environment

We now turn to non-stationary environments. We assume that the error probability of each labeller evolves as a sinusoid between and with some common frequency , namely . The phases are regularly spaced over , i.e., for all .

Figure 5 shows the true parameter of labeller 1 and the estimated value on a sample path for labellers, and various values of the averaging parameter . One clearly sees the bias-variance trade-off underlying the choice of : choosing a small yields small fluctuations but poor tracking performance, while close to leads to large fluctuations centered around the correct value. Furthermore, the natural intuition that is harder to estimate when it is close to is apparent. Finally, for properly chosen (here ), our algorithm effectively tracks the evolving latent parameter .

Figure 6 shows the prediction error rate of our algorithm, for , compared to that of majority vote and to that of an oracle that known exactly for all tasks .

### 6.3 Real datasets

Finally, we test the performance of our algorithm on real, publicly available datasets (see [21, 23] and references therein), whose main characteristics are summarized in Table 1. When the data set has more than two possible labels (which is the case of the “Dog” and the “Web” datasets), say in the set , we merge all labels into label and all labels into label .

Each dataset contains the ground-truth of each task, which allows one to assess the prediction error rate of any algorithm. The results are reported in Table 2 for the following algorithms:

• Majority Vote (MV),

• a standard Expectation Maximization (EM) algorithm known as the DS estimator [4],

• our Agreement-Based (AB) algorithm.

Except for the “Temp” dataset, our algorithm yields some improvement compared to MV, like EM, and a significant performance gain for the “Web” data set, for which more samples are available. The performance of AB and EM are similar for all datasets except for “Bird”, where the number of tasks is limited; this is remarkable given the much lower computational cost of AB, which is linear in the number of samples.

## 7 Conclusion

We have proposed a streaming algorithm for performing crowdsourced data classification. The main feature of this algorithm is to adopt a “direct approach” by inverting the relationship between the agreement rates between various labellers and the latent parameter . This Agreement-Based (AB) algorithm is not a spectral algorithm and does not require to store the task-labeller matrix. Apart from a simple line search, AB does not involve an iterative scheme such as EM or BP.

We have provided performance guarantees for our algorithm in terms of estimation errors. Using this key result, we have shown that our algorithm is optimal in terms of both time complexity (up to logarithmic factors) and regret (compared to the optimal decision). Specifically, we have proved that the cumulative regret is finite, independently of the number of tasks; as a comparison, the cumulative regret of a basic algorithm based on majority vote increases linearly with the number of tasks. We have assessed the performance of AB on both synthetic and real-world data; for the latter, we have seen that AB generally behaves like EM, for a much lower time complexity.

We foresee two directions for future work: on the theoretical side, we want to investigate the extension of AB to more intricate models featuring non-binary labels and where the error probability of labellers depends on the considered task. We would also like to extend our analysis to the sparse regime considered in [9], where the number of answers on a given task does not grow with , so that is proportional to . On the practical side, since AB is designed to work with large data sets provided in real-time as a stream, we hope to be able to experiment its performance on a real-world system.

## References

• [1] P. S. Albert and L. E. Dodd. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60(2):427–435, 2004.
• [2] X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz. Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 193–202. ACM, 2013.
• [3] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In Proceedings of the 22nd international conference on World Wide Web, pages 285–294. International World Wide Web Conferences Steering Committee, 2013.
• [4] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):pp. 20–28, 1979.
• [5] A. Ghosh, S. Kale, and R. P. McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proceedings 12th ACM Conference on Electronic Commerce (EC-2011), San Jose, CA, USA, June 5-9, 2011, pages 167–176, 2011.
• [6] C.-J. Ho, S. Jabbari, and J. W. Vaughan. Adaptive task assignment for crowdsourced classification. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 534–542, 2013.
• [7] C.-J. Ho and J. W. Vaughan. Online task assignment in crowdsourcing markets. In AAAI, volume 12, pages 45–51, 2012.
• [8] S. L. Hui and S. D. Walter. Estimating the error rates of diagnostic tests. Biometrics, pages 167–171, 1980.
• [9] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In Advances in Neural Information Processing Systems 24, pages 1953–1961, 2011.
• [10] D. R. Karger, S. Oh, and D. Shah. Efficient crowdsourcing for multi-class labeling. ACM SIGMETRICS Performance Evaluation Review, 41(1):81–92, 2013.
• [11] D. R. Karger, S. Oh, and D. Shah. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1–24, 2014.
• [12] Q. Liu, J. Peng, and A. T. Ihler. Variational inference for crowdsourcing. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 692–700, 2012.
• [13] Y. Liu and M. Liu. An online learning approach to improving the quality of crowd-sourcing. In Proc. of ACM SIGMETRICS, 2015.
• [14] S. Nitzan and J. Paroush. Optimal decision rules in uncertain dichotomous choice situations. International Economic Review, pages 289–297, 1982.
• [15] F. Parisi, F. Strino, B. Nadler, and Y. Kluger. Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences, 111(4):1253–1258, 2014.
• [16] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. The Journal of Machine Learning Research, 11:1297–1322, 2010.
• [17] L. Shapley and B. Grofman. Optimizing group judgmental accuracy in the presence of interdependencies. Public Choice, 43(3):329–343, 1984.
• [18] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. Advances in Neural Information Processing Systems, pages 1085–1092, 1995.
• [19] D. Wang, T. Abdelzaher, L. Kaplan, and C. C. Aggarwal. Recursive fact-finding: A streaming approach to truth estimation in crowdsourcing applications. In Distributed Computing Systems (ICDCS), 2013 IEEE 33rd International Conference on, pages 530–539. IEEE, 2013.
• [20] P. Welinder and P. Perona. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 25–32. IEEE, 2010.
• [21] J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009.
• [22] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Advances in Neural Information Processing Systems 27, pages 1260–1268, 2014.
• [23] D. Zhou, Q. Liu, J. C. Platt, C. Meek, and N. B. Shah. Regularized minimax conditional entropy for crowdsourcing. arXiv preprint arXiv:1503.07240, 2015.

## Appendix A Proof of Theorem 1

We denote by and the norm and the norm in , respectively.

### a.1 Outline

The proof consists of three steps:

1. Concentration of . Using Hoeffding’s inequality, we prove a concentration inequality on .

2. Fixed-point uniqueness. From the concentration of , we deduce that concentrates around , so that the fixed-point equation has a unique solution with high probability.

3. Smooth dependency between and . When a unique fixed point exists, the mapping depends smoothly on each component of , which implies the concentration of .

### a.2 Intermediate results

Recall that (4) is a necessary and sufficient condition for the existence and uniqueness of a solution to the fixed-point equation . Proposition 4 provides a simpler, sufficient condition. For any , let

 v1(u)=2nn∑i=1(2ui−1).
###### Proposition 4

If then there is a unique solution to the fixed-point equation .

Proof. By the Cauchy-Schwartz inequality,

 n∑i=1√δi(u,v)≤ ⎷nn∑i=1δi(u,v),

so that for all ,

 f(u,v) ≤n(n−2)2n∑i=1δi(u,v), =1(n−2)2(n2v−4(n−1)nn∑i=1(2ui−1)), =n2v−2v1(u)(n−1)(n−2)2.

In particular,

 f(u,v)−v≤2n−1(n−2)2(v−v1(u)).

If , then and there is a unique solution to the fixed-point equation .

Proposition 5 will be used to prove that the fixed-point equation has a unique solution for any in some neighborhood of .

###### Proposition 5

We have .

Proof. By the definition of ,

 (n−1)n∑i=1ai =∑i≠j(pipj+(1−pi)(1−pj)) =(n∑i=1pi)2+(n−n∑i=1pi)2−n∑i=1(p2i+(1−pi)2).

Using the fact that for all and , we obtain the lower bound:

 (n−1)n∑i=1ai≥n2(n(1−2q)2+n−1).

In particular,

 v1(a)=2nn∑i=1(2ai−1)≥2nn−1(1−2q)2≥2v(a).

The result follows from the fact that (see Proposition 3).

Let be the set of vectors for which there is a unique solution to the fixed-point equation . The following result shows the Lipschitz continuity of the function on .

###### Proposition 6

For all in ,

 |v(u)−v(u′)|≤8n||u−u′||1.

Proof. By definition we have for any . Since (see Proposition 2), by the implicit function theorem, is differentiable in the interior of and

 ∀i=1,…,n,∂v∂ui=∂f∂ui1−∂f∂v.

Observing that is positive in the interior of , we calculate the derivatives of , dropping the arguments for convenience:

 ∂f∂v =1(n−2)2(n∑i=1√δi)(n∑i=11/√δi), ∂f∂ui =−8(n−1)n2(n−2)2(n∑j=1√δj/δi).

Now for all ,

 ∂f∂v =1(n−2)2⎡⎣(n∑j=1√δj)⎛⎝∑j≠i1/√δj⎞⎠+n∑j=1√δj/δi⎤⎦, ≥1(n−2)2⎡⎣⎛⎝∑j≠i√δj⎞⎠⎛⎝∑j≠i1/√δj⎞⎠+n∑j=1√δj/δi⎤⎦, ≥1(n−2)2[(n−1)2+n∑j=1√δj/δi], ≥1+1(n−2)2n∑j=1√δj/δi,

where we applied Fact 1 to get the second inequality. Thus

 ∂f∂v−1≥n28(n−1)∣∣∣∂f∂ui∣∣∣,

and

 ∣∣∣∂v∂ui∣∣∣≤8(n−<