Inference Aided Reinforcement Learning for Incentive Mechanism Design in Crowdsourcing

06/01/2018 ∙ by Zehong Hu, et al. ∙ Harvard University Nanyang Technological University 0

Incentive mechanisms for crowdsourcing are designed to incentivize financially self-interested workers to generate and report high-quality labels. Existing mechanisms are often developed as one-shot static solutions, assuming a certain level of knowledge about worker models (expertise levels, costs of exerting efforts, etc.). In this paper, we propose a novel inference aided reinforcement mechanism that learns to incentivize high-quality data sequentially and requires no such prior assumptions. Specifically, we first design a Gibbs sampling augmented Bayesian inference algorithm to estimate workers' labeling strategies from the collected labels at each step. Then we propose a reinforcement incentive learning (RIL) method, building on top of the above estimates, to uncover how workers respond to different payments. RIL dynamically determines the payment without accessing any ground-truth labels. We theoretically prove that RIL is able to incentivize rational workers to provide high-quality labels. Empirical results show that our mechanism performs consistently well under both rational and non-fully rational (adaptive learning) worker models. Besides, the payments offered by RIL are more robust and have lower variances compared to the existing one-shot mechanisms.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to quickly collect large-scale and high-quality labeled datasets is crucial for Machine Learning (ML). Among all proposed solutions, one of the most promising options is crowdsourcing

Howe2006 ; slivkins2014online ; difallah2015dynamics ; simpson2015language . Nonetheless, it has been noted that crowdsourced data often suffers from quality issue, due to its salient feature of no monitoring and no ground-truth verification of workers’ contribution. This quality control challenge has been attempted by two relatively disconnected research communities. From the more ML side, quite a few inference techniques have been developed to infer true labels from crowdsourced and potentially noisy labels raykar2010learning ; liu2012variational ; zhou2014aggregating ; zheng2017truth . These solutions often work as one-shot, post-processing procedures facing a static set of workers, whose labeling accuracy is fixed and informative. Despite their empirical success, the aforementioned methods ignore the effects of incentives when dealing with human inputs. It has been observed both in theory and practice that, without appropriate incentive, selfish and rational workers tend to contribute low quality, uninformative, if not malicious data sheng2008get ; liu2017sequential . Existing inference algorithms are very vulnerable to these cases - either much more redundant labels would be needed (low quality inputs), or the methods would simply fail to work (the case where inputs are uninformative and malicious).

From the less ML side, the above quality control question has been studied in the context of incentive mechanism design. In particular, a family of mechanisms, jointly referred as peer prediction, have been proposed prelec2004bayesian ; jurca2009mechanisms ; witkowski2012peer ; dasgupta2013crowdsourced . Existing peer prediction mechanisms focus on achieving incentive compatibility (IC), which is defined as that truthfully reporting private data, or reporting high quality data, maximizes workers’ expected utilities. These mechanisms achieve IC via comparing the reports from the to-be-scored worker, against those from a randomly selected reference worker, to bypass the challenge of no ground-truth verification. However, we note several undesirable properties of these methods. Firstly, from learning’s perspective, collected labels contain rich information about the ground-truth labels and workers’ labeling accuracy. Existing peer prediction mechanisms often rely on reported data from a small subset of reference workers, which only represents a limited share of the overall collected information. In consequence, the mechanism designer dismisses the opportunity to leverage learning methods to generate a more credible and informative reference answer for the purpose of evaluation. Secondly, existing peer prediction mechanisms often require a certain level of prior knowledge about workers’ models, such as the cost of exerting efforts, and their labeling accuracy when exerting different levels of efforts. However, this prior knowledge is difficult to obtain under real environment. Thirdly, they often assume workers are all fully rational and always follow the utility-maximizing strategy. Rather, they may adapt their strategies in a dynamic manner.

In this paper, we propose an inference-aided reinforcement mechanism, aiming to merge and extend the techniques from both inference and incentive design communities to address the caveats when they are employed alone, as discussed above. The high level idea is as follows: we collect data in a sequential fashion. At each step, we assign workers a certain number of tasks and estimate the true labels and workers’ strategies from their labels. Relying on the above estimates, a reinforcement learning (RL) algorithm is prosed to uncover how workers respond to different levels of offered payments. The RL algorithm determines the payments for the workers based on the collected information up-to-date. By doing so, our mechanism not only incentivizes (non-)rational workers to provide high-quality labels but also dynamically adjusts the payments according to workers’ responses to maximize the data requester’s cumulative utility. Applying standard RL solutions here is challenging, due to unobservable states (workers’ labeling strategies) and reward (the aggregated label accuracy) which is further due to the lack of ground-truth labels. Leveraging standard inference methods seems to be a plausible solution at the first sight (for the purposes of estimating both the states and reward), but we observe that existing methods tend to over-estimate the aggregated label accuracy, which would mislead the superstructure RL algorithm.

We address the above challenges and make the following contributions: (1) We propose a Gibbs sampling augmented Bayesian inference algorithm, which estimates workers’ labeling strategies and the aggregated label accuracy, as done in most existing inference algorithms, but significantly lowers the estimation bias of labeling accuracy. This lays a strong foundation for constructing correct reward signals, which are extremely important if one wants to leverage reinforcement learning techniques. (2) A reinforcement incentive learning (RIL) algorithm is developed to maximize the data requester’s cumulative utility by dynamically adjusting incentive levels according to workers’ responses to payments. (3) We prove that our Bayesian inference algorithm and RIL algorithm are incentive compatible (IC) at each step and in the long run, respectively. (4) Experiments are conducted to test our mechanism, which shows that our mechanism performs consistently well under different worker models. Meanwhile, compared with the state-of-the-art peer prediction solutions, our Bayesian inference aided mechanism can improve the robustness and lower the variances of payments.

2 Related Work

Our work is inspired by the following three lines of literature:

Peer Prediction: This line of work, addressing the incentive issues of eliciting high quality data without verification, starts roughly with the seminal ones prelec2004bayesian ; gneiting2007strictly . A series of follow-ups have relaxed various assumptions that have been made jurca2009mechanisms ; witkowski2012peer ; radanovic2013robust ; dasgupta2013crowdsourced .

Inference method: Recently, inference methods have been applied to crowdsourcing settings, aiming to uncover the true labels from multiple noisily copies. Notable successes include EM method dawid1979maximum ; raykar2010learning ; zhang2014spectral , Variational Inference liu2012variational ; chen2015statistical and Minimax Entropy Inference zhou2012learning ; zhou2014aggregating . Besides, Zheng et al. zheng2017truth provide a good survey for the existing ones.

Reinforcement Learning: Over the past two decades, reinforcement learning (RL) algorithms have been proposed to iteratively improve the acting agent’s learned policy Watkins92 ; Tesauro95 ; Sutton98 ; Gordon00 ; Szepesvari10

. More recently, with the help of advances in feature extraction and state representation, RL has made several breakthroughs in achieving human-level performance in challenging domains

Mnih15 ; Liang16 ; Hasselt2016DeepRL ; Silver17 . Meanwhile, many studies successfully deploy RL to address some societal problems Yu2013EmotionalMR ; Leibo2017 . RL has also helped make progress in human-agent collaboration engel2005reinforcement ; gasic2014gaussian ; Sadhu2016ArgusSH ; Wang2017 .

Our work differs from the above literature in the connection between incentive mechanisms and ML. There have been a very few recent studies that share a similar research taste with us. For example, to improve the data requester’s utility in crowdsourcing settings, Liu and Chen liu2017sequential develop a multi-armed bandit algorithm to adjust the state-of-the-art peer prediction mechanism DG13 dasgupta2013crowdsourced to a prior-free setting. Nonetheless, the results in above work require workers to follow a Nash Equilibrium at each step in a sequential setting, which is hard to achieve in practice. Instead of randomly choosing a reference worker as commonly done in peer prediction, Liu and Chen liu2017machine

propose to use supervised learning algorithms to generate the reference reports and derive the corresponding IC conditions. However, these reports need to be based on the contextual information of the tasks. By contrast, in this paper, without assuming the contextual information about the tasks, we use Bayesian inference to learn workers’ states and true labels, which leads to an unsupervised-learning solution.

3 Problem Formulation

This paper considers the following data acquisition problem via crowdsourcing: at each discrete time step , a data requester assigns tasks with binary answer space to candidate workers to label. Workers receive payments for submitting a label for each task. We use to denote the label worker generates for task at time . For simplicity of computation, we reserve if is not assigned to . Furthermore, we use and to denote the set of ground-truth labels and the set of all collected labels respectively.

The generated label depends both on the latent ground-truth and worker ’s strategy, which is mainly determined by two factors: exerted effort level (high or low) and reporting strategy (truthful or deceitful). Accommodating the notation commonly used in reinforcement learning, we also refer worker ’s strategy as his/her internal state. At any given time, workers at their will adopt an arbitrary combination of effort level and reporting strategy. Specifically, we define and as worker

’s probability of exerting high efforts and reporting truthfully for task

, respectively. Furthermore, we use and to denote worker ’s probability of observing the true label when exerting high and low efforts, respectively. Correspondingly, we denote worker ’s cost of exerting high and low efforts by and , respectively. For the simplicity of analysis, we assume that and . All the above parameters and workers’ actions stay unknown to our mechanism. In other words, we regard workers as black-boxes, which distinguishes our mechanism from the existing peer prediction mechanisms.

Worker ’s probability of being correct (PoBC) at time for any given task is given as


Suppose we assign tasks to worker at step . Then, a risk-neutral worker’s utility satisfies:


where denotes our payment to worker for task at time (see Section 4 for more details).

At the beginning of each step, the data requester and workers agree to a certain rule of payment, which is not changed until the next time step. The workers are self-interested and may choose their strategies in labeling and reporting according to the expected utility he/she can get. After collecting the generated labels, the data requester infers the true labels by running a certain inference algorithm. The aggregated label accuracy and the data requester’s utility are defined as follows:


where is a non-decreasing monotonic function mapping accuracy to utility and is a tunable parameter balancing label quality and costs.

4 Inference-Aided Reinforcement Mechanism for Crowdsourcing

Figure 1: Overview of our incentive mechanism.

Our mechanism mainly consists of three components: the payment rule, Bayesian inference and reinforcement incentive learning (RIL); see Figure 1 for an overview, where estimated values are denoted with tildes. The payment rule computes the payment to worker for his/her label on task


where denotes the scaling factor, determined by RIL at the beginning of every step and shared by all workers. denotes worker ’s score on task , which will be computed by the Bayesian inference algorithm. is a constant representing the fixed base payment. The Bayesian inference algorithm is also responsible for estimating the true labels, workers’ PoBCs and the aggregated label accuracy at each time step, preparing the necessary inputs to RIL. Based on these estimates, RIL seeks to maximize the cumulative utility of the data requester by optimally balancing the utility (accuracy in labels) and the payments.

4.1 Bayesian Inference

For the simplicity of notation, we omit the superscript in this subsection. The motivation for designing our own Bayesian inference algorithm is as follows. We ran several preliminary experiments using popular inference algorithms, for example, EM dawid1979maximum ; raykar2010learning ; zhang2014spectral and Variational Inference liu2012variational ; chen2015statistical ). Our empirical studies reveal that those methods tend to heavily bias towards over-estimating the aggregated label accuracy when the quality of labels is low.111See Section 5.1 for detailed experiment results and analysis. This leads to biased estimation of the data requester’s utility (as it cannot be observed directly), and this estimated utility is used as the reward signal in RIL, which will be detailed later. Since the reward signal plays the core role in guiding the reinforcement learning process, the heavy bias will severely mislead our mechanism.

To reduce the estimation bias, we develop a Bayesian inference algorithm by introducing soft Dirichlet priors to both the distribution of true labels , where and denote that of label and , respectively, and workers’ PoBCs . Then, we derive the conditional distribution of true labels given collected labels as (see Appendix A) where denotes the beta function, , , , and , where and .

1:  Input: the collected labels , the number of samples
2:  Output: the sample sequence
3:  , Initialize

with the uniform distribution

4:  for  to  do
5:     for  to  do
6:          and compute
7:          and compute
8:          Sample with
9:     Append to the sample sequence
Algorithm 1 Gibbs sampling for crowdsourcing

Note that it is generally hard to derive an explicit formula for the posterior distribution of a specific task ’s ground-truth from the conditional distribution

. We thus resort to Gibbs sampling for the inference. More specifically, according to Bayes’ theorem, we know that the conditional distribution of task

’s ground-truth satisfies , where denotes all tasks excluding

. Leveraging this, we generate samples of the true label vector

following Algorithm 1. At each step of the sampling procedure (lines 6-7), Algorithm 1 first computes and then generates a new sample of to replace the old one in . After traversing through all tasks, Algorithm 1 generates a new sample of the true label vector . Repeating this process for times, we get samples, which is recorded in . Here, we write the -th sample as . Since Gibbs sampling requires a burn-in process, we discard the first samples and calcualte worker ’s score on task and PoBC as


Similarly, we can obtain the estimates of the true label distribution and then derive the log-ratio of task , . Furthermore, we decide the true label estimate as if and as if . Correspondingly, the label accuracy is estimated as


In our Bayesian inference algorithm, workers’ scores, PoBCs and the true label distribution are all estimated by comparing the true label samples with the collected labels. Thus, t To prove the convergence of our algorithm, we need to bound the ratio of wrong samples. We introduce and to denote the number of tasks of which the true label sample in Eqn. (5) is correct () and wrong () in the -th sample, respectively. Formally, we have:

Lemma 1.

Let , and . When ,


where , and .

The proof is in Appendix B. Our main idea is to introduce a set of counts for the collected labels and then calculate and based on the distribution of these counts. Using Lemma 1, the convergence of our Bayesian inference algorithm states as follows:

Theorem 1 (Convergence).

When and , if most of workers report truthfully (i.e. ), with probability at least , holds for any worker ’s PoBC estimate as well as the true label distribution estimate ().

The convergence of and can naturally lead to the convergence of and because the latter estimates are fully computed based on the former ones. All these convergence guarantees enable us to use the estimates computed by Bayesian inference to construct the state and reward signal in our reinforcement learning algorithm RIL.

4.2 Reinforcement Incentive Learning

In this subsection, we formally introduce our reinforcement incentive learning (RIL) algorithm, which adjusts the scaling factor

to maximize the data requesters’ utility accumulated in the long run. To fully understand the technical background, readers are expected to be familiar with Q-value and function approximation. For readers with limited knowledge, we kindly refer them to Appendix D, where we provide background on these concepts. With transformation, our problem can be perfectly modeled as a Markov Decision Process. To be more specific, our mechanism is the agent and it interacts with workers (i.e. the environment); scaling factor

is the action; the utility of the data requester defined in Eqn. (3) is the reward. Workers’ reporting strategies are the state. After receiving payments, workers may change their strategies to, for example, increase their utilities at the next step. How workers change their strategies forms the state transition kernel.

On the other hand, the reward defined in Eqn. (3) cannot be directly used because the true accuracy cannot be observed. Thus, we use the estimated accuracy calculated by Eqn. (6) instead to approximate as in Eqn. (8). Furthermore, to achieve better generalization across different states, it is a common approach to learn a feature-based state representation Mnih15 ; Liang16 . Recall that the data requester’s implicit utility at time only depends on the aggregated PoBC averaged across all workers. Such observation already points out to a representation design with good generalization, namely . Further recall that, when deciding the current scaling factor , the data requester does not observe the latest workers’ PoBCs and thus cannot directly estimate the current . Due to this one-step delay, we have to build our state representation using the previous observation. Since most workers would only change their internal states after receiving a new incentive, there exists some imperfect mapping function . Utilizing this implicit function, we introduce the augmented state representation in RIL as in Eqn. (8).


Since neither nor can be perfectly inferred, it would not be a surprise to observe some noise that cannot be directly learned in our Q-function. For most crowdsourcing problems the number of tasks

is large, so we can leverage the central limit theorem to justify our modeling of the noise using a Gaussian process. To be more specific, we calculate the temporal difference (TD) error as

1:  for each episode do
2:     for each step in the episode do
3:         Decide the scaling factor as (-greedy method)
4:         Assign tasks and collect labels from the workers
5:         Run Bayesian inference to get and
6:         Use to update , and in Eqn. (10)
Algorithm 2 Reinforcement Incentive Learning (RIL)

where the noise follows a Gaussian process, and denotes the current policy. By doing so, we gain two benefits. First, the approximation greatly simplifies the derivation of the update equation for the Q-function. Secondly, as shown in our empirical results later, this kind of approximation is robust against different worker models. Besides, following gasic2014gaussian we approximate Q-function as , where also follows a Gaussian process.

Under the Gaussian process approximation, all the observed rewards and the corresponding values up to the current step form a system of equations, and it can be written as , where , and denote the collection of rewards, values, and residuals. Following Gaussian process’s assumption for residuals, , where . The matrix satisfies and for . Then, by using the online Gaussian process regression algorithm engel2005reinforcement , we effectively learn the Q-function as


where and . Here, we use to denote the Gaussian kernel. Finally, we employ the classic -greedy method to decide based on the learned Q-function. To summarize, we provide a formal description about RIL in Algorithm 2. Note that, when updating , and in Line 6, we employ the sparse approximation proposed in gasic2014gaussian to discard some data so that the size of these matrices does not increase infinitely.

5 Theoretical Analysis on Incentive Compatibility

In this section, we prove the incentive compatibility of our Bayesian inference and reinforcement learning algorithms. Our main results are as follows:

Theorem 2 (One Step IC).

At any time step , when , reporting truthfully and exerting high efforts is the utility-maximizing strategy for any worker at equilibrium (if other workers all follow this strategy).

In Appendix E, we prove that when , if , any worker ’s utility-maximizing strategy would be reporting truthfully and exerting high efforts. Since Theorem 1 has provided the convergence guarantee, we can conclude Theorem 2. ∎

Theorem 3 (Long Term IC).

Suppose the conditions in Theorem 2 are satisfied and the learned -function approaches the real . When the following equation holds for ,


always reporting truthfully and exerting high efforts is the utility-maximizing strategy for any worker in the long term if other workers all follow this strategy. Here, denotes the minimal gap between two available values of the scaling factor.

In order to induce RIL to change actions, worker must let RIL learn a wrong -function. Thus, our main idea of proof is to derive the upper bounds of the effects of worker ’s reports on the -function. Besides, Theorem 3 points that, to design robust reinforcement learning algorithms against the manipulation of strategical agents, we should leave a certain level of gaps between actions. This observation may be of independent interests to reinforcement learning researchers.

6 Empirical Experiments

In this section, we empirically investigate the competitiveness of our solution. To be more specific, we first show our proposed Bayesian inference algorithm can produce more accurate estimates about the aggregated label accuracy when compared with the existing inference algorithms. Then, we demonstrate that, aided by Bayesian inference, our RIL algorithm consistently manages to learn a good incentive policy under various worker models. Lastly, we show as a bonus benefit of our mechanism that, leveraging Bayesian inference to fully exploit the information contained in the collected labels leads to more robust and lower-variance payments at each step.

6.1 Empirical Analysis on Bayesian Inference

The aggregated label accuracy estimated from our Bayesian inference algorithm serves as a major component of the state representation and reward function to RIL, and thus critically affects the performance of our mechanism. Given so, we choose to first investigate the bias of our Bayesian inference algorithm. In Figure 1(a), we compare our Bayesian inference algorithm with two popular inference algorithms in crowdsourcing, that is, the EM estimator raykar2010learning and the variational inference estimator liu2012variational . Here, we employ the famous RTE dataset, where workers need to check whether a hypothesis sentence can be inferred from the provided sentence snow2008cheap . In order to simulate strategic behaviors of workers, we mix these data with random noise by replacing a part of real-world labels with uniformly generated ones (low quality labels).

From the figure, we conclude that compared with EM and variational inference, our Bayesian inference algorithm can significantly lower the bias of the estimates of the aggregated label accuracy. In fact, we cannot use the estimates from the EM and variational inference as alternatives for the reward signal because the biases of their estimates even reach while the range of the label accuracy is only between . This set of experiments justifies our motivation to develop our own inference algorithm and reinforces our claim that our inference algorithm could provide fundamentals for the further development of potential learning algorithms for crowdsourcing.

Figure 2: Empirical analysis on Bayesian Inference (a) and RIL (b-c). To be more specific, (a) compares the inference bias (i.e. the difference from the inferred label accuracy to the real one) of our Bayesian inference algorithm with that of EM and variational inference, averaged over 100 runs. (b) draws the gap between the estimation of the data requester’s cumulative utility and the real one, smoothed over 5 episodes. (c) shows the learning curve of our mechanism, smoothed over 5 episodes.
Method Rational QR MWU
Fixed Optimal 27.584 (.253) 21.004 (.012) 11.723 (.514)
Heuristic Optimal 27.643 (.174) 21.006 (.001) 12.304 (.515)
Adaptive Optimal 27.835 (.209) 21.314 (.011) 17.511 (.427)
RIL 27.184 (.336) 21.016 (.018) 15.726 (.416)
Table 1:

Performance comparison under three worker models. Data requester’s cumulative utility normalized over the number of tasks. Standard deviation reported in parenthesis.

6.2 Empirical Analysis on RIL

We move on to investigate whether RIL consistently learns a good policy, which maximizes the data requester’s cumulative utility . For all the experiments in this subsection, we set , , , , , the set of the scaling factor , the exploration rate for RIL and , for the utility function (Eqn. (3)) and the number of time steps for an episode as

. We report the averaged results over 5 runs to reduce the effect of outliers. To demonstrate our algorithm’s general applicability, we test it under three different worker models, each representing a popular family of human behavioral model. We provide a simple description of them as follows, whereas the detailed version is deferred to Appendix H. (i)

Rational workers alway take the utility-maximizing strategies. (ii) QR workers mckelvey1995quantal follow strategies corresponding to an utility-dependent distribution (which is pre-determined). This model has been used to study agents with bounded rationality. (iii) MWU workers littlestone1994weighted update their strategies according to the celebrated multiplicative weights update algorithm. This model has been used to study adaptive learning agents.

Our first set of experiments is a continuation to the last subsection. To be more specific, we first focus on the estimation bias of the data requester’s cumulative utility . This value is used as the reward in RIL and is calculated from the estimates of the aggregated label accuracy. This set of experiments aim to investigate whether our RIL module successfully leverages the label accuracy estimates and picks up the right reward signal. As Figure 1(b) shows, the estimates only deviate from the real values in a very limited magnitude after a few episodes of learning, regardless of which worker model the experiments run with. The results further demonstrate that our RIL module observe reliable rewards. The next set of experiments is about how quickly RIL learns. As Figure 1(c) shows, under all three worker models, RIL manages to pick up and stick to a promising policy in less than 100 episodes. This observation also demonstrates the robustness of RIL under different environments.

Our last set of experiments in this subsection aim to evaluate the competitiveness of the policy learned by RIL. In Table 1, we use the policy learned after 500 episodes with exploration rate turned off (i.e. ) and compare it with three benchmarks constructed by ourselves. To create the first one, Fixed Optimal, we try all 4 possible fixed value for the scaling factor and report the highest cumulative reward realized by either of them. To create the second one, Heuristic Optimal, we divide the value region of into five regions: , , , and . For each region, we select a fixed value for the scaling factor . We traverse all possible combinations to decide the optimal heuristic strategy. To create the third one, Adaptive Optimal, we change the scaling factor every steps and report the highest cumulative reward via traversing all possible configurations. This benchmark is infeasible to be reproduced in real-world practice, once the number of steps becomes large. Yet it is very close to the global optimal in the sequential setting. As Table 1 demonstrates, the two benchmarks plus RIL all achieve a similar performance tested under rational and QR workers. This is because these two kinds of workers have a fixed pattern in responding to incentives and thus the optimal policy would be a fixed scaling factor throughout the whole episode. On contrast, MWU workers adaptively learn utility-maximizing strategies gradually, and the learning process is affected by the incentives. Under this worker environment, RIL managers to achieve an average utility score of , which is a significant improvement over fixed optimal and heuristic optimal (which achieve and respectively) considering the unrealistic global optimal is only around . Up to this point, with three sets of experiments, we demonstrate the competitiveness of RIL and its robustness under different work environments. Note that, when constructing the benchmarks, we also conduct experiments on DG13, the state-of-the-art peer prediction mechanism for binary labels dasgupta2013crowdsourced , and get the same conclusion. For example, when DG13 and MWU workers are tested for Fixed Optimal and Heuristic Optimal, the cumulative utilities are and , respectively, which also shows a large gap with RIL.

6.3 Empirical Analysis on One Step Payments

In this subsection, we compare the one step payments provided by our mechanism with the payments calculated by DG13, the state-of-the-art peer prediction mechanism for binary labels dasgupta2013crowdsourced . We fix the scaling factor and set , , , and . To set up the experiments, we generate task ’s true label following its distribution (to be specified) and worker ’s label for task based on ’s PoBC and . In Figure 2(a), we let all workers excluding report truthfully and exert high efforts (i.e. ), and increase from to . In Figure 2(b), we let , and increase other workers’ PoBCs from to . As both figures reveal, in our mechanism, the payment for worker almost only depends on his/her own strategy. On contrast, in DG13, the payments are clearly affected by the distribution of true labels and the strategies of other workers. In other words, our Bayesian inference is more robust to different environments. Furthermore, in Figure 2(c), we present the standard deviation of the payment to worker . We let , and increase from to . As shown in the figure, our method manages to achieve a noticeably smaller standard deviation compared to DG13. Note that, in Figure 2(b), we implicitly assume that most of workers will at least not adversarially report false labels, which is widely-adopted in previous studies liu2012variational . For workers’ collusion attacks, we also have some defending tricks provided in Appendix F.

Figure 3: Empirical analysis on our Bayesian inference algorithm, averaged over 1000 runs. (a) Average payment per task given true label’s distribution. (b) Average payment per task given PoBCs of workers excluding . (c) The standard deviation of the payment given worker ’s PoBC.

7 Conclusion

In this paper, we build an inference-aided reinforcement mechanism leveraging Bayesian inference and reinforcement learning techniques to learn the optimal policy to incentivize high-quality labels from crowdsourcing. Our mechanism is proved to be incentive compatible. Empirically, we show that our Bayesian inference algorithm can help improve the robustness and lower the variance of payments, which are favorable properties in practice. Meanwhile, our reinforcement incentive learning (RIL) algorithm ensures our mechanism to perform consistently well under different worker models.


This work was conducted within Rolls-Royce@NTU Corporate Lab with support from the National Research Foundation (NRF) Singapore under the Corp Lab@University Scheme. Yitao is partially supported by NSF grants #IIS-1657613, #IIS-1633857 and DARPA XAI grant #N66001-17-2-4032. The authors also thank Anxiang Zeng from Alibaba Group for valuable discussions.


  • [1] R. Arratia and L. Gordon.

    Tutorial on large deviations for the binomial distribution.

    Bulletin of Mathematical Biology, 51(1):125–131, Jan 1989.
  • [2] Erick Chastain, Adi Livnat, Christos Papadimitriou, and Umesh Vazirani. Algorithms, games, and evolution. PNAS, 111(29):10620–10623, 2014.
  • [3] Xi Chen, Qihang Lin, and Dengyong Zhou. Statistical decision making for optimal budget allocation in crowd labeling. Journal of Machine Learning Research, 16:1–46, 2015.
  • [4] Anirban Dasgupta and Arpita Ghosh. Crowdsourced judgement elicitation with endogenous proficiency. In Proc. of WWW, 2013.
  • [5] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20–28, 1979.
  • [6] Djellel Eddine Difallah, Michele Catasta, Gianluca Demartini, Panagiotis G Ipeirotis, and Philippe Cudré-Mauroux. The dynamics of micro-task crowdsourcing: The case of amazon mturk. In Proc. of WWW, 2015.
  • [7] Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with gaussian processes. In Proc. of ICML, 2005.
  • [8] Milica Gasic and Steve Young. Gaussian processes for pomdp-based dialogue manager optimization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1):28–40, 2014.
  • [9] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
  • [10] Geoffrey J. Gordon. Reinforcement Learning with Function Approximation Converges to a Region. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pages 1040–1046, 2000.
  • [11] Jeff Howe. The rise of crowdsourcing. Wired Magazine, 14(6), 06 2006.
  • [12] Radu Jurca, Boi Faltings, et al. Mechanisms for making crowds truthful.

    Journal of Artificial Intelligence Research

    , 34(1):209, 2009.
  • [13] Joel Z. Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proc. of AAMAS, 2017.
  • [14] Yitao Liang, Marlos C. Machado, Erik Talvitie, and Michael Bowling. State of the art control of atari games using shallow reinforcement learning. In Proc. of AAMAS, 2016.
  • [15] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
  • [16] Qiang Liu, Jian Peng, and Alexander T Ihler. Variational inference for crowdsourcing. In Proc. of NIPS, 2012.
  • [17] Yang Liu and Yiling Chen. Machine-learning aided peer prediction. In Proc. of ACM EC, 2017.
  • [18] Yang Liu and Yiling Chen. Sequential peer prediction: Learning to elicit effort using posted prices. In Proc. of AAAI, pages 607–613, 2017.
  • [19] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995.
  • [20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level Control through Deep Reinforcement Learning. Nature, 518(7540):529–533, 02 2015.
  • [21] Frank W. J. Olver. NIST Handbook of Mathematical Functions. Cambridge University Press, 2010.
  • [22] Dražen Prelec. A bayesian truth serum for subjective data. Science, 306(5695):462–466, 2004.
  • [23] Goran Radanovic and Boi Faltings. A robust bayesian truth serum for non-binary signals. In Proc. of AAAI, 2013.
  • [24] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.
  • [25] Vidyasagar Sadhu, Gabriel Salles-Loustau, Dario Pompili, Saman A. Zonouz, and Vincent Sritapan. Argus: Smartphone-enabled human cooperation via multi-agent reinforcement learning for disaster situational awareness. In Proc. of ICAC, 2016.
  • [26] Victor S Sheng, Foster Provost, and Panagiotis G Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proc. of SIGKDD, 2008.
  • [27] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354 EP –, 10 2017.
  • [28] Edwin D Simpson, Matteo Venanzi, Steven Reece, Pushmeet Kohli, John Guiver, Stephen J Roberts, and Nicholas R Jennings. Language understanding in the wild: Combining crowdsourcing and machine learning. In Proc. of WWW, 2015.
  • [29] Aleksandrs Slivkins and Jennifer Wortman Vaughan. Online decision making in crowdsourcing markets: Theoretical challenges. ACM SIGecom Exchanges, 12(2):4–23, 2014.
  • [30] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proc. of EMNLP, 2008.
  • [31] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
  • [32] Csaba Szepesvári. Algorithms for Reinforcement Learning. Synthesis lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2010.
  • [33] Gerald Tesauro. Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38(3):58–68, March 1995.
  • [34] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, 2016.
  • [35] Yue Wang and Fumin Zhang. Trends in Control and Decision-Making for Human-Robot Collaboration Systems. Springer Publishing Company, Incorporated, 1st edition, 2017.
  • [36] Christopher J. C. H. Watkins and Peter Dayan. Technical Note: Q-Learning. Machine Learning, 8(3-4), May 1992.
  • [37] Jens Witkowski and David C Parkes. Peer prediction without a common prior. In Proc. of ACM EC, 2012.
  • [38] Chao Yu, Minjie Zhang, and Fenghui Ren. Emotional multiagent reinforcement learning in social dilemmas. In PRIMA, 2013.
  • [39] Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Proc. of NIPS, 2014.
  • [40] Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. Truth inference in crowdsourcing: is the problem solved? Proc. of the VLDB Endowment, 10(5):541–552, 2017.
  • [41] Dengyong Zhou, Qiang Liu, John Platt, and Christopher Meek. Aggregating ordinal labels from crowds by minimax conditional entropy. In Proc. of ICML, 2014.
  • [42] Denny Zhou, Sumit Basu, Yi Mao, and John C Platt. Learning from the wisdom of crowds by minimax entropy. In Proc. of NIPS, 2012.


A Derivation of Posterior Distribution

It is not had to figure out the joint distribution of the collected labels

and the true labels


where and . and denote the distribution of true label and , respectively. Besides, and . Then, the joint distribution of , , and


where denotes the beta function, and

In this case, we can conduct marginalization via integrating the joint distribution over and as


where and . Following Bayes’ theorem, we can know that


B Proof for Lemma 1

b.1 Basic Lemmas

We firstly present some lemmas for our proof later.

Lemma 2.

If , holds for any , where is the binomial distribution.



denotes the moment generating function. ∎

Lemma 3.

For given , if , we can have

By the definition of the beta function [21],


we can have


where we regard and . Thus, according to Lemma 2, we can obtain


For the integral operation, substituting with at first and then with , we can conclude Lemma 3. ∎

Lemma 4.


Lemma 5.


Lemma 6.


Lemma 7.


Lemma 8.


Lemma 9.

If , we can have

where .

To prove the lemmas above, we firstly define


Then, Lemma 4 can be obtained by expanding . Lemma 5 can be proved as follows


Lemma 6 can be obtained as follows


For Lemma 7, we can have


Thus, we can have


which concludes Lemma 7. Then, Lemma 8 can be obtained by considering Eqn. (25).


For Lemma 9, we can have


where . Let , we can have


Since , and . Considering Hoeffding’s inequality, we can get


which concludes the first inequality in Lemma 9. Similarly, for the second inequality, we can have


where . Suppose , we can have


Considering Hoeffding’s inequality, we can also get


which concludes the second inequality in Lemma 9. ∎

Lemma 10.

For any , we can have

Firstly, we can know . Let . Then, we can have and . Thus, and we can conclude Lemma 10 by taking this inequality into the equality. ∎

Lemma 11.

is a concave function when .

, where . when . Thus, is monotonically decreasing when , which concludes Lemma 11. ∎

Lemma 12.

For ,


When , we can have


When , we can have


Lemma 13.

If and , then

where .

For the first inequality, we can have


According to the inequality in [1], we can have


where , which concludes the first inequality in Lemma 13.

For the second inequality, we can have