## 1 Introduction

Crowdsourcing platforms provide labor markets in which pieces of micro-tasks are electronically distributed to any workers who are willing to complete them for a small fee. In typical crowdsourcing scenarios, such as those on Amazon’s Mechanical Turk, a requester first posts a collection of tasks, for example a set of images to be labelled. Then, from a pool of workers, whoever is willing can pick up a subset of those tasks and provide her labels for a small amount of payment. Typically, a fixed amount of payment per task is predetermined and agreed upon between the requester and the workers, and hence the worker is paid the amount proportional to the number of tasks she answers. Further, as the verification of the correctness of the answers is difficult, and also as the requesters are afraid of losing reputation among the crowd, requesters typically choose to pay for every label she gets regardless of the correctness of the provided labels. Hence, the budget of the total payments the requester makes to the workers is proportional to the total number of labels she collects.

One of the major issues in such crowdsourcing platforms is label quality assurance. Some workers are spammers trying to make easy money, and even those who are willing to work frequently make mistakes as the reward is small and tasks are tedious. To correct for these errors, a common approach is to introduce redundancy by collecting answers from multiple workers on the same task and aggregating these responses using some schemes such as majority voting. A fundamental problem of interest in such a system is how to maximize the accuracy of thus aggregated answers, while minimizing the cost. Collecting multiple labels per task can improve the accuracy of our estimates, but increases the budget proportionally. Given a fixed number of tasks to be labelled, a requester hopes to achieve the best trade-off between the accuracy and the budget, i.e. the total number of responses the requester collects on the crowdsourcing platform. There are two design choices the requester has in achieving this goal: task assignment and inference algorithm.

In typical crowdsourcing platforms, tasks are assigned as follows. Since the workers are fleeting, the requester has no control over who will be the next arriving worker. Workers arrive in an online fashion, complete the tasks that they are given, and leave. Each arriving worker is completely new and you may never get her back. Nevertheless, it might be possible to improve accuracy under the same budget, by designing better task assignments. The requester has the following control over the task assignment. At each point in time, we have the control over which tasks to assign to the next arriving worker. The requester is free to use all the information collected thus far, including all the task assignments to previous workers and the answers collected on those assigned tasks. By adaptively identifying tasks that are more difficult and assigning more (future) workers on those tasks, one hopes to be more efficient in the budget-accuracy trade-off. This paper makes this intuition precise, by studying a canonical crowdsourcing model and comparing the fundamental trade-offs between adaptive schemes and non-adaptive schemes. Unlike adaptive schemes, a non-adaptive scheme fixes all the task assignments before any labels are collected and does not allow future assignments to adapt to the labels collected thus far for each arriving worker. Precise definitions of adaptive and non-adaptive task assignments are provided in Section 1.1.

While adaptive task assignments handle the heterogeneity in the task difficulties by assigning more workers to difficult tasks, inferring such unknown difficulty of the tasks (as well as inferring unknown heterogeneity of the worker reliabilities) requires inference: estimating the latent parameters and the ground truth labels from crowdsourced responses thus far. Some workers are more reliable than the others, but we do not know their latent reliabilities. Some tasks are more difficult than the others, but we do not know their latent difficulty levels. We only get to observe the answers provided by those workers on their assigned tasks. Nevertheless, by comparing responses from multiple workers, we can estimate the true labels and the difficulties of the tasks, and use them in subsequent steps in our inference algorithm to learn the reliability of the workers. We perform such inferences at several points in time over the course of collecting all the labels we have budgeted for. The inference algorithm outputs the current estimates for the labels and difficulty levels of the tasks, which are used in subsequent time to assign tasks.

### 1.1 Model and problem formulation

We assume that the requester has

binary classification tasks to be labelled by querying a crowdsourcing platform multiple times. For example, those might be image classification tasks, where the requester wants to classify

images as either suitable for children () or not (). The requester has a budget on how many responses she can collect on the crowdsourcing platform, assuming one unit of payment is made for each response collected. We use interchangeably to refer to both a target budget and also the budget used by a particular task assignment scheme (as defined in (1)), and it should be clear from the context which one we mean. We want to find the true label by querying noisy workers who are arriving in an online fashion, one at a time.Task assignment and inference. Typical crowdsourcing systems are modeled as a discrete time systems where at each time we have a new arriving worker. At time , the requester chooses an action , which is a subset of tasks to be assigned to the -th arriving worker. Then, the -th arriving worker provides her answer for each task . We use the index to denote both the -th time step in this discrete time system as well as the -th arriving worker. At this point (at the end of -th time step), all previous responses are stored in a sparse matrix , and this data matrix is increasing by one column at each time. We let if task is not assigned to worker , i.e. , and otherwise we let be the previous worker ’s response on task . At the next time , the next task assignment is chosen, and this process is repeated. At time , the action (or the task assignment) can depend on all previously collected responses up to the current time step stored in a sparse (growing) matrix . This process is repeated until the task assignment scheme decides to stop, typically when the total number of collected responses (the number of nonzero entires in ) meet a certain budget constraint or when a certain target accuracy is estimated to be met.

We consider both a non-adaptive scenario and an adaptive scenario. In a non-adaptive scenario, a fixed number of workers to be recruited are pre-determined (and hence the termination time is set to be ) and also fixed task assignments ’s for all are pre-determined, before any response is collected. In an adaptive scenario, the requester chooses ’s in an online fashion based on all the previous answers collected thus far. For both adaptive and non-adaptive scenarios, when we have determined that we have collected all the data we need, an inference algorithm is applied on the collected data to output an estimate for the ground truth label for the -th task for each . Note that we use

to denote the total number of workers recruited, which is a random variable under the adaptive scenario. Also, note that the estimated labels for all the tasks do not have to be simultaneously output in the end, and we can choose to output estimated labels on some of the tasks in the middle of the process before termination. The average accuracy of our estimates is measured by the average probability of error

under a probabilistic model to be defined later in this section in Eq. (2).The total budget used in one instance of such a process is measured by the total number of responses collected, which is equal to the number of non-zero entries in . This inherently assumes that there is a prefixed fee of one unit for each response that is agreed upon, and the requester pays this constant fee for every label that is collected. The expected budget used by a particular task assignment scheme will be denoted by

(1) |

where the expectation is over all the randomness in the model (the problem parameters representing the quality of the tasks and the quality of the workers, and the noisy responses from workers) and any randomness used in the task assignment. We are interested in designing task assignment schemes and inference algorithms that achieve the best accuracy within a target expected budget, under the following canonical model of how workers respond to tasks.

Worker responses. We assume that when a task is assigned to a worker, the response follows a probabilistic model introduced by [37], which is a recent generalization of the Dawid-Skene model originally introduced by [6]. Precisely, each new arriving worker is parametrized by a latent worker quality parameter (for the -th arriving worker). Each task is parametrized by a latent task quality parameter (for the -th task). When a worker is assigned a task , the generalized Dawid-Skene model assumes that the response is a random variable distributed as

(2) |

conditioned on the parameters and , where and . The task parameter represents the probability that a task is perceived as a positive task to a worker, and the worker parameter represents the probability the worker makes a mistake in labelling the task. Concretely, when a task is presented to any worker, the task is perceived as a positive task with a probability or a negative task otherwise, independent of any other events. Let denote this perceived label of task as seen by worker . Conditioned on this perceived label of the task, a worker with parameter makes a mistake with probability . She provides a ‘correct’ label as she perceives it with probability , or provides an ‘incorrect’ label with probability . Hence, the response follows the distribution in (2). The response is, for example, a positive label if the task is perceived as a positive task and the worker does not make a mistake (which happens with a probability ), or if the task is perceived as a negative task and the worker does not make a mistake (which happens with a probability ). Alternately, the task parameter represents the probability that a task is labeled as a positive task by a perfect worker, a worker with parameter . That is represents inherent ambiguity of the task being labeled positive. The strengths and weaknesses of this model are discussed in comparisons to related work in Section 1.2.

Prior distribution on worker reliability. We assume that worker parameters ’s are i.i.d. according to some prior distribution . For example, each arriving worker might be sampled with replacement from a pool of workers, and denotes the discrete distribution of the quality parameters of the pool. The individual reliabilities ’s are hidden from us, and the prior distribution is also unknown. We assume we only know some statistics of the prior distribution , namely

(3) |

where is a random variable distributed as , and is the (shifted and scaled) average reliability of the crowd and is the key quantity of capturing the collective quality of the crowd as a whole. Intuitively, when all workers are truthful and have close to a one, then the collective reliability will be close to its maximum value of one. On the other hand, if most of the workers are giving completely random answers with ’s close to a half, then will be close to its minimum value of a zero. The fundamental trade-off between the accuracy and the budget will primarily depend on the distribution of the crowd via . We do not impose any conditions on the distribution .

Prior distribution on task quality. We assume that the task parameters ’s are drawn i.i.d. according to some prior distribution . The individual difficulty of a task with a quality parameter is naturally captured by

(4) |

as tasks with close to a half are confusing and ambiguous tasks and hence difficult to correctly label ( close to zero), whereas tasks with close to zero or one are unambiguous tasks and easy to correctly label ( close to one). The average difficulty and the collective difficulty of tasks drawn from a prior distribution are captured by the quantities and , defined as

(5) |

where is distributed as . The fundamental budget-accuracy trade-off depends on primarily via this . Another quantities that will show up in our main results is the worst-case difficulty in the given set of tasks (conditioned on all the ’s) defined as

(6) |

When we refer to a similar quantities from the population distributed as , we abuse the notation and denote and . The individual task parameters ’s are hidden from us. We do not have access to the prior distribution on the task qualities ’s, but we assume we know the statistics , , , and , and we assume we also know a quantized version of the prior distribution on the task difficulties ’s, which we explain below.

Quantized prior distribution on task difficulty. Given a distribution on ’s, let be the induced distribution on ’s. For example, if , then the induced distribution on is . Our approach requires only the knowledge of a quantized version of the distribution , namely . This quantized distribution has support at discrete values , where

(7) |

such that . We denote these values by such that for each . Then the quantized distribution is , where the probability mass for the -th partition is

which is the fraction of tasks whose difficulty is in . We use the closed interval for the last partition. In the above example, we have , , and . For notational convenience, we eliminate those partitions with zero probability mass, and re-index the quantization to get , for , such that for all . We define to be the re-indexed quantized distribution . In the above example, we finally have .

We denote the maximum and minimum probability mass in as

(8) |

Similar to the collective quality defined for the distribution in (5), we define , collective quality for the quantized distribution , which is used in our algorithm. .

Ground truth. The ground truth label of a task is also naturally defined as what the majority of the crowd would agree on if we ask all the workers to label that task, i.e. , where the expectation is with respect to the prior distribution of and the randomness in the response as per the generalized Dawid-Skene model in (2). Without loss of generality, we assume that the average reliability of the worker is positive, i.e. and take as the ground truth label of task conditioned on its difficulty parameter :

(9) |

The latent parameters , , and are unknown, and we want to infer the true labels ’s from only ’s.

Performance measure. The accuracy of the final estimate is measured by the average probability of error:

(10) |

We investigate the fundamental trade-off between budget and error rate by identifying the sufficient and necessary conditions on the expected budget for achieving a desired level of accuracy . Note that we are interested in achieving the best trade-off, which in turn can give the best approach for both scenarios: when we have a fixed budget constraint and want to minimize the error rate, and when we have a target error rate and want to minimize the cost.

### 1.2 Related work

The generalized Dawid-Skene model studied in this paper allows the tasks to be heterogeneous (having different difficulties) and the workers to be heterogeneous (having different reliabilities). The original Dawid-Skene (DS) model introduced in [6] and analyzed in [15] is a special case, when only workers are allowed to be heterogeneous. All tasks have the same difficulty with for all and can be either zero or one depending on the true label. Most of existing work on the DS model assumes that tasks are randomly assigned and focuses only on the inference problem of finding the true labels. Several inference algorithms have been proposed [6, 30, 12, 29, 10, 13, 22, 38, 20, 34, 5, 14, 26, 2, 3, 23].

A most relevant work is by [15]. It is shown that in order to achieve a probability of error less than a small positive constant , it is necessary to have an expected budget scaling as , even for the best possible inference algorithm together with the best possible task assignment scheme, including all possible adaptive task assignment schemes. Further, a simple randomized non-adaptive task assignment is proven to achieve this optimal trade-off with a novel spectral inference algorithm. Namely, an efficient task assignment and an inference algorithm are proposed that together guarantees to achieve with budget scaling as . It is expected that this necessary and sufficient budget constraint scales linearly in , the number of tasks to be labelled. The technical innovation of [15] is in designing a new spectral algorithm that achieves a logarithmic dependence in the target error rate ; and identifying defined in (3) as the fundamental statistics of that captures the collective quality of the crowd. The budget-accuracy trade-off mainly depends on the prior distribution of the crowd via a single parameter . When we have a reliable crowd with many workers having ’s close to one, the collective quality is close to one and the required budget is small. When we have an unreliable crowd with many workers having ’s close to a half, then the collective quality is close to zero and the required budget is large. However, perhaps one of the most surprising result of [15] is that the optimal trade-off is matched by a non-adaptive task assignment scheme. In other words, there is only a marginal gain in using adaptive task assignment schemes.

This negative result relies crucially on the fact that, under the standard DS model, all tasks are inherently equally difficult. As all tasks have ’s either zero or one, the individual difficulty of a task is , and a worker’s probability of making an error on one task is the same as any other tasks. Hence, adaptively assigning more workers to relatively more ambiguous tasks has only a marginal gain. However, simple adaptive schemes are widely used in practice, where significant gains are achieved. In real-world systems, tasks are widely heterogeneous. Some images are much more difficult to classify (and find the true label) compared to other images. To capture such varying difficulties in the tasks, generalizations of the DS model were proposed in [32, 31, 37, 28] and significant improvements have been reported on real datasets.

The generalized DS model serves as the missing piece in bridging the gap between practical gains of adaptivity and theoretical limitations of adaptivity (under the standard DS model). We investigate the fundamental question of “do adaptive task assignments improve accuracy?” under this generalized Dawid-Skene model of Eq. (2).

On the theoretical understanding of the original DS model, the dense regime has been studied first, where all workers are assigned all tasks. A spectral method for finding the true labels was first analyzed in [10] and an EM approach followed by spectral initial step is analyzed in [34] to achieve a near-optimal performance. The minimax error rate of this problem was identified in [9] by analyzing the MAP estimator, which is computationally intractable.

In this paper, we are interested in a more challenging setting where each task is assigned only a small number of workers of . For a non-adaptive task assignment, a novel spectral algorithm based on the non-backtracking operator of the matrix has been analyzed under the original DS model by [13], which showed that the proposed spectral approach is near-optimal. Further, [15] showed that any non-adaptive task assignment scheme will have only marginal improvement in the error rate under the original DS model. Hence, there is no significant gain in adaptivity.

One of the main weaknesses of the DS model is that it does not capture how some tasks are more difficult than the others. To capture such heterogeneity in the tasks, several practical models have been proposed recently [12, 32, 31, 37, 11]. Although such models with more parameters can potentially better describe real-world datasets, there is no analysis on their performance under adaptive or non-adaptive task assignments. We do not have the analytical tools to understand the fundamental trade-offs involved in those models yet. In this work, we close this gap by providing a theoretical analysis of one of the generalizations of the DS model, namely the one proposed in [37]. It captures the heterogeneous difficulties in the tasks, while remaining simple enough for theoretical analyses.

### 1.3 Contributions

To investigate the gain of adaptivity, we first characterize the fundamental lower bound on the budget required to achieve a target accuracy. To match this fundamental limit, we introduce a novel adaptive task assignment scheme. The proposed adaptive task assignment is simple to apply in practice, and numerical simulations confirm the superiority compared to state-of-the-art non-adaptive schemes. Under certain assumptions on the choice of parameters in the algorithm, which requires a moderate access to an oracle, we can prove that the performance of the proposed adaptive scheme matches that of the fundamental limit up to a constant factor. Finally, we quantify the gain of adaptivity by proving a strictly larger lower bound on the budget required for any non-adaptive schemes to achieve a desired error rate of for some small positive .

Precisely, we show that the minimax rate on the budget required to achieve a target average error rate of scales as . The dependence on the prior and are solely captured in (the quality of the crowd as a whole) and (the quality of the tasks as a whole). We show that the fundamental trade-off for non-adaptive schemes is , requiring a factor of larger budget for non-adaptive schemes. This factor of is always at least one and quantifies precisely how much we gain by adaptivity.

### 1.4 Outline and notations

We present a list of notations and their definitions in Table 1. In Section 2, we present the fundamental lower bound on the necessary budget to achieve a target average error rate of . We present a novel adaptive approach which achieves the fundamental lower bound up to a constant. In comparison, we provide the fundamental lower bound on the necessary budget for non-adaptive approaches in Section 3, and we present a non-adaptive approach that achieves this fundamental limit. In Section 4, we give a spectral interpretation of our approach justifying the proposed inference algorithm, leading to a parameter estimation algorithm that serves as a building block in the main approach of Algorithm 1. As our proposed sub-routine using Algorithm 2 suffers when the budget is critically limited (known as spectral barrier in Section 4), we present another algorithm that can substitute Algorithm 2 in Section 5 and compare their performances. The proofs of the main results are provided in Section 6. We present a conclusion with future research directions in Section 7.

notation | data type | definition |
---|---|---|

the number of tasks | ||

total number of workers recruited | ||

labels collected from the workers | ||

budget used in collecting is the number of nonzero entries in | ||

average budget per task | ||

the budget required to achieve error at most | ||

set of task assignment schemes using at most queries in expectation | ||

[m] | index for tasks | |

[n] | index for workers | |

subset of | a set of workers assigned to task | |

subset of | a set of tasks assigned to worker | |

quality parameter of task | ||

ground truths label of task | ||

estimated label of task | ||

quality parameter of worker | ||

prior distribution of | ||

prior distribution of | ||

prior distribution of induced from | ||

quantized version of the distribution | ||

average reliability of the crowd as per : | ||

collective reliability of the crowd as per : | ||

individual difficulty level of task : | ||

worst-case difficulty as per : | ||

best-case difficulty as per : | ||

collective difficulty level of the tasks as per : | ||

collective difficulty level of the tasks as per : | ||

average difficulty of tasks as per | ||

index for support points of quantized distribution | ||

difficulty level of -th support point of | ||

probability mass at in | ||

minimum probability mass in : | ||

maximum probability mass in : | ||

number of rounds in Algorithm 1 | ||

index for a round in Algorithm 1 | ||

number of sub-rounds in round of Algorithm 1 | ||

index for a sub-round of Algorithm 1 |

## 2 Main Results under the Adaptive Scenario

In this section, we present our main results under the adaptive task assignment scenario.

### 2.1 Fundamental limit under the adaptive scenario

With a slight abuse of notations, we let be a mapping from to representing an inference algorithm outputting the estimates of the true labels. We drop and write only whenever it is clear from the context. We let be the set of all the prior distributions on such that the collective worker quality is , i.e.

(11) |

We let be the set of all the prior distributions on such that the collective task difficulty is , i.e.

(12) |

We consider all task assignment schemes in , the set of all task assignment schemes that make at most queries to the crowd in expectation. We prove a lower bound on the standard minimax error rate: the error that is achieved by the best inference algorithm using the best adaptive task assignment scheme under a worst-case worker parameter distribution and the worst-case task parameter distribution . A proof of this theorem is provided in Section 6.1.

###### Theorem 2.1.

For , there exists a positive constant such that the average probability of error is lower bounded by

(13) |

where is the number of tasks, is the expected budget allowed in , is the collective difficulty of the tasks from a prior distribution defined in (5), and is the collective reliability of the crowd from a prior distribution defined in (3).

In the proof, we provide a proof of a slightly stronger statement in Lemma 6.1, where a similar lower bound holds for not only the worst-case but for all . One caveat is that there is now an extra additive term in the error exponent in the RHS of the lower bound that depends on , which is subsumed in the constant term for the worst-case in the RHS of (13). We are assigning queries per task on average, and it is intuitive that the error decays exponentially in . The novelty in the above analysis is that it characterizes how the error exponent depends on the , which determines the quality of the crowd you have in your crowdsourcing platform, and , which determines the quality of the tasks you have in your hand. If we have easier tasks and reliable workers, the error rate should be smaller. Eq. (13) shows that this is captured by the error exponent scaling linearly in . This gives a lower bound (i.e. a necessary condition) on the budget required to achieve error at most ; there exists a constant such that if the total budget is

(14) |

then no task assignment scheme (adaptive or not) with any inference algorithm can achieve error less than . This recovers the known fundamental limit for standard DS model where all tasks have and hence in [15]. For this standard DS model, it is known that there exists a constant such that if the total budget is less than

then no task assignment with any inference algorithm can achieve error rate less than . For example, consider two types of prior distributions where in one we have the original DS tasks with and in the other we have . We have under and under . Our analysis, together with the matching upper bound in the following section, shows that one needs times more budget to achieve the same accuracy under the tasks from .

### 2.2 Upper bound on the achievable error rate

We present an adaptive task assignment scheme and an iterative inference algorithm that asymptotically achieve an error rate of , when the number of tasks grows large and the expected budget is increasing as where and is a constant that only depends on . This matches the lower bound in (13) when and are . Comparing it to a fundamental lower bound in Theorem 2.1 establishes the near-optimality of our approach, and the sufficient condition to achieve average error is for the average total budget to be larger than,

(15) |

Our proposed adaptive approach in Algorithm 1 takes as input the number of tasks , a target budget , hyper parameter to be determined by our theoretical analyses in Theorem 2.2, the quantized prior distribution , the statistics and on the worker prior . The proposed scheme makes at most queries in expectation to the crowd and outputs the estimated labels ’s for all the tasks .

#### 2.2.1 The proposed adaptive approach: overview.

At a high level, our approach works in rounds indexed by , the support size of the quantized distribution , and sub-rounds at each round , where is chosen by the algorithm in line 5. In each sub-round, we perform both task assignment and inference, sequentially. Guided by the inference algorithm, we permanently label a subset of the tasks and carry over the remaining ones to subsequent sub-rounds. Inference is done in line 11 to get a confidence score ’s on the tasks , where is the set of tasks that are remaining to be labelled at the current sub-round. The adaptive task assignment of our approach is entirely managed by the choice of this set in line 21, as only those tasks in will be assigned new workers in the next sub-round in lines 9 and 10.

At each round, we choose how many responses to collect for each task present in that round as prescribed by our theoretical analysis. Given this choice of , the number of responses collected for each task at round t, we repeat the key inner-loop in line 9-21 of Algorithm 1. In round the sub-round is repeated times to ensure that sufficient number of ‘easy’ tasks are classified. Given a set of remaining tasks to be labelled, the sub-round collects response per task on those tasks in and runs an inference algorithm (Algorithm 2) to give confidence scores ’s to all . Our theoretical analysis prescribes a choice of a threshold to be used in round sub-round . All tasks in with confidence score larger than are permanently labelled as positive tasks, and those with confidence score less than are permanently labelled as negative tasks. Those permanently labelled tasks are referred to as ‘classified’ and removed from the set . The remaining tasks with confidence scores between and are carried over to the next sub-round. The confidence scores are designed such that the sign of provides the estimated true label, and we are more confident about this estimated label if the absolute value of the score is larger. The art is in choosing the appropriate number of responses to be collected for each task and the threshold , and our theoretical analyses, together with the provided statistics of the prior distribution , and the prior quantized distribution allow us to choose the ones that achieve a near optimal performance.

Note that we are mixing inference steps and task assignment steps. Within each sub-round, we are performing both task assignment and inference. Further, the inner-loop within itself uses a non-adaptive task assignment, and hence our approach is a series of non-adaptive task assignments with inference in each sub-round. However, Algorithm 1 is an adaptive scheme, where the adaptivity is fully controlled by the set of remaining unclassified tasks . We are adaptively choosing which tasks to carry over in the set based on all the responses we have collected thus far, and we are assigning more workers to only those tasks in .

Since difficulty levels are varying across the tasks, it is intuitive to assign fewer workers to easy tasks and more workers to hard tasks. Supposing that we know the difficulty levels ’s, we could choose to assign the ideal number of workers to each task according to ’s. However, the difficulty levels are not known. The proposed approach starts with a smaller budget in the first round classifying easier tasks, and carries over the more difficult tasks to the later rounds where more budget per task will be assigned.

#### 2.2.2 The proposed adaptive approach: precise.

More precisely, given a budget and the statistics of , and the known quantized distribution we know what target probability of error to aim for, say , from Theorem 2.2. The main idea behind our approach is to allocate the given budget over multiple rounds appropriately, and at each round get an estimate of the labels of the remaining tasks in and also the confidence scores, such that with an appropriate choice of the threshold those tasks we choose to classify in the current round achieve the desired target error rate of . As long as this guarantee holds at each round for all classified tasks, then the average error rate will also be bounded by when the process terminates eventually. The only remaining issue is how many queries are made in total when this process terminates. We guarantee that in expectation at most queries are made under our proposed choices of ’s and ’s in the algorithm.

At round zero, we initially put all the tasks in . A fraction of tasks are permanently labelled in each round and the un-labelled ones are taken to the next round. At round , our goal is to classify a sufficient fraction of those tasks in the -th difficulty group with the desired level of accuracy. The art is in choosing the right number of responses to be collected per task for that round and also the right threshold on the confidence score, to be used in the inner-loop in line 9-21 of Algorithm 1. If is too low and/or threshold too small, then misclassification rate will be too large. If is too large and/or is too large, we are wasting our budget and achieving unnecessarily high accuracy on those tasks classified in the current round, and not enough tasks will be classified in that round. We choose and appropriately to ensure that the misclassification probability is at most based on our analysis (see (54)) of the inner-loop. We run the identical sub-rounds times to ensure that enough fraction of tasks with difficulty are classified. Precisely, the choice of insures that the expected number of tasks with difficulty remaining unclassified after -th round is at most equal to the number of tasks in the next group, i.e., difficulty level .

Note that statistically, the fraction of the -th group (i.e. tasks with difficulty ) that get classified before the -th round is very small as the threshold set in these rounds is more than their absolute mean message. Most tasks in the -th group will get classified in round . Further, the proposed pre-processing step of binning the tasks ensures that . This ensures that the total extraneous budget spent on the -th group of tasks is not more than a constant times the allocated budget on those tasks.

The main algorithmic component is the inner-loop in line 9-21 of Algorithm 1. For a choice of the (per task) budget , we collect responses according to a -regular random graph on tasks and

workers. The leading eigen-vector of the non-backtracking operator on this bipartite graph, weighted by the

responses reveals a noisy observation of the true class and the difficulty levels of the tasks. Letdenote this top left eigenvector, computed as per the message-passing algorithm of Algorithm

2. Then the -th entry asymptotically converges in the large number of tasks limit to a Gaussian random variable with mean proportional to the difficulty level, with mean and variance specified in Lemma

6.3. This non-backtracking operator approach to crowdsourcing was first introduced in [13] for the standard DS model. We generalize their analysis to this generalized DS model in Theorem 3.1for finite sample regime, and further give a sharper characterization based on central limit theorem in the asymptotic regime (Lemma

6.3). For a detailed explanation of Algorithm 2 and its analyses, we refer to Section 3.#### 2.2.3 Justification of the choice of and .

The main idea behind our approach is to allocate a target budget to each -th task according to its quantized difficulty where is such that . Given a total budget and the quantized distribution which gives the collective difficulty of tasks (line , Algorithm 1), we target to assign workers to a task of quantized difficulty . This choice of the target budget is motivated from the proof of the lower bound Theorem 2.1. If we had identified the tasks with respect to their difficulty then the near-optimal choice of the budget that achieves the lower bound is given in (43). Our target budget is a simplified form of the near-optimal choice and ignores the constant part that does not depend upon the total budget. This choice of the budget would give the equal probability of misclassification for the tasks of varying difficulties. We refer to this error rate as the desired probability of misclassification. As we do not know which tasks belong to which quantized difficulty group , a factor of is needed to compensate for the extra budget needed to infer those difficulty levels. This justifies our choice of budget in line of the Algorithm 1.

From our theoretical analysis of the inner loop, we know the probability of misclassification for a task that belongs to difficulty group as a function of the classification threshold and the budget that is assigned to it. Therefore, in each round we set the classification threshold such that even the possibly most difficult task achieves the desired probability of misclassification. This choice of is provided in line of Algorithm 1.

#### 2.2.4 Numerical experiments.

In Figure 1, we compare the performance of our algorithm with majority voting and also a non-adaptive version of our Algorithm 1, where we assign to each task number of workers in one round and set classification threshold so as to classify all the tasks (choosing and ). Since this performs the non-adaptive inner-loop once, this is a non-adaptive algorithm, and has been introduced for the standard DS model in [15].

For numerical experiments, we make a slight modification to our proposed Algorithm 1. In the final round, when the classification threshold is set to zero, we include all the responses collected thus far when running the message passing Algorithm 2, and not just the fresh samples collected in that round. This creates dependencies between rounds, which makes the analysis challenging. However, in practice we see improved performance and it allows us to use the given fixed budget efficiently.

We run synthetic experiments with and fix for the non-adaptive version. The crowds are generated from the spammer-hammer model where a worker is a hammer () with probability and a spammer () otherwise. In the left panel, we take difficulty level

to be uniformly distributed over

, that gives . In the right panel, we take with probability , otherwise we take it to be or with equal probability, that gives . Our adaptive algorithm improves significantly over its non-adaptive version, and our main results in Theorems 2.2 and 3.1 predicts such gain of adaptivity. In particular, for the left panel, the non-adaptive algorithm’s error scaling depends on smallest that is while for the adaptive algorithm it scales with . In the left figure, it can be seen that the adaptive algorithm requires approximately a factor of more queries to achieve the same error as achieved by the non-adaptive scheme. For example, non-adaptive version of Algorithm 1 requires to achieve error rate , whereas the adaptive approach only requires . Quantifying such a gap is one of our main results in Theorems 2.2 and 3.1. This gap widens in the right panel to approximately as predicted. For a fair comparison with the non-adaptive version, we fix the total budget to be and assign workers in each round until the budget is exhausted, such that we are strictly using budget at most deterministically.#### 2.2.5 Performance Guarantee

Algorithm 1 is designed in such a way that we are not wasting any budget on any of the tasks; we are not getting unnecessarily high accuracy on easier tasks, which is the root cause of inefficiency for non-adaptive schemes. In order to achieve this goal, the internal parameter computed in line 12 of Algorithm 1 has to satisfy , which is the average difficulty of the remaining tasks. Such a choice is important in choosing the right threshold .

As the set of remaining tasks is changing over the course of the algorithm, we need to estimate this value in each sub-routine. We provide an estimator of in Algorithm 3 that only uses the sampled responses that are already collected. All numerical results are based on this estimator. However, analyzing the sensitivity of the performance with respect to the estimation error in is quite challenging, and for a theoretical analysis, we assume we have access to an oracle that provides the exact value of , replacing Algorithm 3.

###### Theorem 2.2.

Suppose Algorithm 3 returns the exact value of . With the choice of , for any given quantized prior distribution of task difficulty such that and , and the budget , the expected number of queries made by Algorithm 1 is asymptotically bounded by

where is the number of tasks remaining unclassified in the sub-round, and is the pre-determined number of workers assigned to each of these tasks in that round. Further, Algorithm 1 returns estimates that asymptotically achieve,

(16) |

if , where , and

(17) |

if .

A proof of this theorem is provided in Section 6.4. In this theoretical analysis, we are considering a family of problem parameters in an increasing number of tasks . All the problem parameters , , and can vary as functions of . For example, consider a family of independent of and . As grows, most of the workers are spammers giving completely random answers. In this setting, we can ask how should the budget grow with , in order to achieve a target accuracy of, say, ? We have and , indicating that the collective difficulty is constant but collective quality of the workers are decreasing in . It is a simple calculation to show that and in this case, and the above theorem proves that is sufficient to achieve the desired error rate. Further such dependence of the budget in is also necessary, as follows from our lower bound in Theorem 2.1.

Consider now a scenario where we have tasks with increasing difficulties in . For example, and . We have and . It follows from simple calculations that and . It follows that it is sufficient and necessary to have budget scaling in this case as