1 Introduction
The ability to quickly collect largescale and highquality labeled datasets is crucial for Machine Learning (ML). Among all proposed solutions, one of the most promising options is crowdsourcing
Howe2006 ; slivkins2014online ; difallah2015dynamics ; simpson2015language . Nonetheless, it has been noted that crowdsourced data often suffers from quality issue, due to its salient feature of no monitoring and no groundtruth verification of workers’ contribution. This quality control challenge has been attempted by two relatively disconnected research communities. From the more ML side, quite a few inference techniques have been developed to infer true labels from crowdsourced and potentially noisy labels raykar2010learning ; liu2012variational ; zhou2014aggregating ; zheng2017truth . These solutions often work as oneshot, postprocessing procedures facing a static set of workers, whose labeling accuracy is fixed and informative. Despite their empirical success, the aforementioned methods ignore the effects of incentives when dealing with human inputs. It has been observed both in theory and practice that, without appropriate incentive, selfish and rational workers tend to contribute low quality, uninformative, if not malicious data sheng2008get ; liu2017sequential . Existing inference algorithms are very vulnerable to these cases  either much more redundant labels would be needed (low quality inputs), or the methods would simply fail to work (the case where inputs are uninformative and malicious).From the less ML side, the above quality control question has been studied in the context of incentive mechanism design. In particular, a family of mechanisms, jointly referred as peer prediction, have been proposed prelec2004bayesian ; jurca2009mechanisms ; witkowski2012peer ; dasgupta2013crowdsourced . Existing peer prediction mechanisms focus on achieving incentive compatibility (IC), which is defined as that truthfully reporting private data, or reporting high quality data, maximizes workers’ expected utilities. These mechanisms achieve IC via comparing the reports from the tobescored worker, against those from a randomly selected reference worker, to bypass the challenge of no groundtruth verification. However, we note several undesirable properties of these methods. Firstly, from learning’s perspective, collected labels contain rich information about the groundtruth labels and workers’ labeling accuracy. Existing peer prediction mechanisms often rely on reported data from a small subset of reference workers, which only represents a limited share of the overall collected information. In consequence, the mechanism designer dismisses the opportunity to leverage learning methods to generate a more credible and informative reference answer for the purpose of evaluation. Secondly, existing peer prediction mechanisms often require a certain level of prior knowledge about workers’ models, such as the cost of exerting efforts, and their labeling accuracy when exerting different levels of efforts. However, this prior knowledge is difficult to obtain under real environment. Thirdly, they often assume workers are all fully rational and always follow the utilitymaximizing strategy. Rather, they may adapt their strategies in a dynamic manner.
In this paper, we propose an inferenceaided reinforcement mechanism, aiming to merge and extend the techniques from both inference and incentive design communities to address the caveats when they are employed alone, as discussed above. The high level idea is as follows: we collect data in a sequential fashion. At each step, we assign workers a certain number of tasks and estimate the true labels and workers’ strategies from their labels. Relying on the above estimates, a reinforcement learning (RL) algorithm is prosed to uncover how workers respond to different levels of offered payments. The RL algorithm determines the payments for the workers based on the collected information uptodate. By doing so, our mechanism not only incentivizes (non)rational workers to provide highquality labels but also dynamically adjusts the payments according to workers’ responses to maximize the data requester’s cumulative utility. Applying standard RL solutions here is challenging, due to unobservable states (workers’ labeling strategies) and reward (the aggregated label accuracy) which is further due to the lack of groundtruth labels. Leveraging standard inference methods seems to be a plausible solution at the first sight (for the purposes of estimating both the states and reward), but we observe that existing methods tend to overestimate the aggregated label accuracy, which would mislead the superstructure RL algorithm.
We address the above challenges and make the following contributions: (1) We propose a Gibbs sampling augmented Bayesian inference algorithm, which estimates workers’ labeling strategies and the aggregated label accuracy, as done in most existing inference algorithms, but significantly lowers the estimation bias of labeling accuracy. This lays a strong foundation for constructing correct reward signals, which are extremely important if one wants to leverage reinforcement learning techniques. (2) A reinforcement incentive learning (RIL) algorithm is developed to maximize the data requester’s cumulative utility by dynamically adjusting incentive levels according to workers’ responses to payments. (3) We prove that our Bayesian inference algorithm and RIL algorithm are incentive compatible (IC) at each step and in the long run, respectively. (4) Experiments are conducted to test our mechanism, which shows that our mechanism performs consistently well under different worker models. Meanwhile, compared with the stateoftheart peer prediction solutions, our Bayesian inference aided mechanism can improve the robustness and lower the variances of payments.
2 Related Work
Our work is inspired by the following three lines of literature:
Peer Prediction: This line of work, addressing the incentive issues of eliciting high quality data without verification, starts roughly with the seminal ones prelec2004bayesian ; gneiting2007strictly . A series of followups have relaxed various assumptions that have been made jurca2009mechanisms ; witkowski2012peer ; radanovic2013robust ; dasgupta2013crowdsourced .
Inference method: Recently, inference methods have been applied to crowdsourcing settings, aiming to uncover the true labels from multiple noisily copies. Notable successes include EM method dawid1979maximum ; raykar2010learning ; zhang2014spectral , Variational Inference liu2012variational ; chen2015statistical and Minimax Entropy Inference zhou2012learning ; zhou2014aggregating . Besides, Zheng et al. zheng2017truth provide a good survey for the existing ones.
Reinforcement Learning: Over the past two decades, reinforcement learning (RL) algorithms have been proposed to iteratively improve the acting agent’s learned policy Watkins92 ; Tesauro95 ; Sutton98 ; Gordon00 ; Szepesvari10
. More recently, with the help of advances in feature extraction and state representation, RL has made several breakthroughs in achieving humanlevel performance in challenging domains
Mnih15 ; Liang16 ; Hasselt2016DeepRL ; Silver17 . Meanwhile, many studies successfully deploy RL to address some societal problems Yu2013EmotionalMR ; Leibo2017 . RL has also helped make progress in humanagent collaboration engel2005reinforcement ; gasic2014gaussian ; Sadhu2016ArgusSH ; Wang2017 .Our work differs from the above literature in the connection between incentive mechanisms and ML. There have been a very few recent studies that share a similar research taste with us. For example, to improve the data requester’s utility in crowdsourcing settings, Liu and Chen liu2017sequential develop a multiarmed bandit algorithm to adjust the stateoftheart peer prediction mechanism DG13 dasgupta2013crowdsourced to a priorfree setting. Nonetheless, the results in above work require workers to follow a Nash Equilibrium at each step in a sequential setting, which is hard to achieve in practice. Instead of randomly choosing a reference worker as commonly done in peer prediction, Liu and Chen liu2017machine
propose to use supervised learning algorithms to generate the reference reports and derive the corresponding IC conditions. However, these reports need to be based on the contextual information of the tasks. By contrast, in this paper, without assuming the contextual information about the tasks, we use Bayesian inference to learn workers’ states and true labels, which leads to an unsupervisedlearning solution.
3 Problem Formulation
This paper considers the following data acquisition problem via crowdsourcing: at each discrete time step , a data requester assigns tasks with binary answer space to candidate workers to label. Workers receive payments for submitting a label for each task. We use to denote the label worker generates for task at time . For simplicity of computation, we reserve if is not assigned to . Furthermore, we use and to denote the set of groundtruth labels and the set of all collected labels respectively.
The generated label depends both on the latent groundtruth and worker ’s strategy, which is mainly determined by two factors: exerted effort level (high or low) and reporting strategy (truthful or deceitful). Accommodating the notation commonly used in reinforcement learning, we also refer worker ’s strategy as his/her internal state. At any given time, workers at their will adopt an arbitrary combination of effort level and reporting strategy. Specifically, we define and as worker
’s probability of exerting high efforts and reporting truthfully for task
, respectively. Furthermore, we use and to denote worker ’s probability of observing the true label when exerting high and low efforts, respectively. Correspondingly, we denote worker ’s cost of exerting high and low efforts by and , respectively. For the simplicity of analysis, we assume that and . All the above parameters and workers’ actions stay unknown to our mechanism. In other words, we regard workers as blackboxes, which distinguishes our mechanism from the existing peer prediction mechanisms.Worker ’s probability of being correct (PoBC) at time for any given task is given as
(1) 
Suppose we assign tasks to worker at step . Then, a riskneutral worker’s utility satisfies:
(2) 
where denotes our payment to worker for task at time (see Section 4 for more details).
At the beginning of each step, the data requester and workers agree to a certain rule of payment, which is not changed until the next time step. The workers are selfinterested and may choose their strategies in labeling and reporting according to the expected utility he/she can get. After collecting the generated labels, the data requester infers the true labels by running a certain inference algorithm. The aggregated label accuracy and the data requester’s utility are defined as follows:
(3) 
where is a nondecreasing monotonic function mapping accuracy to utility and is a tunable parameter balancing label quality and costs.
4 InferenceAided Reinforcement Mechanism for Crowdsourcing
Our mechanism mainly consists of three components: the payment rule, Bayesian inference and reinforcement incentive learning (RIL); see Figure 1 for an overview, where estimated values are denoted with tildes. The payment rule computes the payment to worker for his/her label on task
(4) 
where denotes the scaling factor, determined by RIL at the beginning of every step and shared by all workers. denotes worker ’s score on task , which will be computed by the Bayesian inference algorithm. is a constant representing the fixed base payment. The Bayesian inference algorithm is also responsible for estimating the true labels, workers’ PoBCs and the aggregated label accuracy at each time step, preparing the necessary inputs to RIL. Based on these estimates, RIL seeks to maximize the cumulative utility of the data requester by optimally balancing the utility (accuracy in labels) and the payments.
4.1 Bayesian Inference
For the simplicity of notation, we omit the superscript in this subsection. The motivation for designing our own Bayesian inference algorithm is as follows. We ran several preliminary experiments using popular inference algorithms, for example, EM dawid1979maximum ; raykar2010learning ; zhang2014spectral and Variational Inference liu2012variational ; chen2015statistical ). Our empirical studies reveal that those methods tend to heavily bias towards overestimating the aggregated label accuracy when the quality of labels is low.^{1}^{1}1See Section 5.1 for detailed experiment results and analysis. This leads to biased estimation of the data requester’s utility (as it cannot be observed directly), and this estimated utility is used as the reward signal in RIL, which will be detailed later. Since the reward signal plays the core role in guiding the reinforcement learning process, the heavy bias will severely mislead our mechanism.
To reduce the estimation bias, we develop a Bayesian inference algorithm by introducing soft Dirichlet priors to both the distribution of true labels , where and denote that of label and , respectively, and workers’ PoBCs . Then, we derive the conditional distribution of true labels given collected labels as (see Appendix A) where denotes the beta function, , , , and , where and .
Note that it is generally hard to derive an explicit formula for the posterior distribution of a specific task ’s groundtruth from the conditional distribution
. We thus resort to Gibbs sampling for the inference. More specifically, according to Bayes’ theorem, we know that the conditional distribution of task
’s groundtruth satisfies , where denotes all tasks excluding. Leveraging this, we generate samples of the true label vector
following Algorithm 1. At each step of the sampling procedure (lines 67), Algorithm 1 first computes and then generates a new sample of to replace the old one in . After traversing through all tasks, Algorithm 1 generates a new sample of the true label vector . Repeating this process for times, we get samples, which is recorded in . Here, we write the th sample as . Since Gibbs sampling requires a burnin process, we discard the first samples and calcualte worker ’s score on task and PoBC as(5) 
Similarly, we can obtain the estimates of the true label distribution and then derive the logratio of task , . Furthermore, we decide the true label estimate as if and as if . Correspondingly, the label accuracy is estimated as
(6) 
In our Bayesian inference algorithm, workers’ scores, PoBCs and the true label distribution are all estimated by comparing the true label samples with the collected labels. Thus, t To prove the convergence of our algorithm, we need to bound the ratio of wrong samples. We introduce and to denote the number of tasks of which the true label sample in Eqn. (5) is correct () and wrong () in the th sample, respectively. Formally, we have:
Lemma 1.
Let , and . When ,
(7) 
where , and .
The proof is in Appendix B. Our main idea is to introduce a set of counts for the collected labels and then calculate and based on the distribution of these counts. Using Lemma 1, the convergence of our Bayesian inference algorithm states as follows:
Theorem 1 (Convergence).
When and , if most of workers report truthfully (i.e. ), with probability at least , holds for any worker ’s PoBC estimate as well as the true label distribution estimate ().
The convergence of and can naturally lead to the convergence of and because the latter estimates are fully computed based on the former ones. All these convergence guarantees enable us to use the estimates computed by Bayesian inference to construct the state and reward signal in our reinforcement learning algorithm RIL.
4.2 Reinforcement Incentive Learning
In this subsection, we formally introduce our reinforcement incentive learning (RIL) algorithm, which adjusts the scaling factor
to maximize the data requesters’ utility accumulated in the long run. To fully understand the technical background, readers are expected to be familiar with Qvalue and function approximation. For readers with limited knowledge, we kindly refer them to Appendix D, where we provide background on these concepts. With transformation, our problem can be perfectly modeled as a Markov Decision Process. To be more specific, our mechanism is the agent and it interacts with workers (i.e. the environment); scaling factor
is the action; the utility of the data requester defined in Eqn. (3) is the reward. Workers’ reporting strategies are the state. After receiving payments, workers may change their strategies to, for example, increase their utilities at the next step. How workers change their strategies forms the state transition kernel.On the other hand, the reward defined in Eqn. (3) cannot be directly used because the true accuracy cannot be observed. Thus, we use the estimated accuracy calculated by Eqn. (6) instead to approximate as in Eqn. (8). Furthermore, to achieve better generalization across different states, it is a common approach to learn a featurebased state representation Mnih15 ; Liang16 . Recall that the data requester’s implicit utility at time only depends on the aggregated PoBC averaged across all workers. Such observation already points out to a representation design with good generalization, namely . Further recall that, when deciding the current scaling factor , the data requester does not observe the latest workers’ PoBCs and thus cannot directly estimate the current . Due to this onestep delay, we have to build our state representation using the previous observation. Since most workers would only change their internal states after receiving a new incentive, there exists some imperfect mapping function . Utilizing this implicit function, we introduce the augmented state representation in RIL as in Eqn. (8).
(8) 
Since neither nor can be perfectly inferred, it would not be a surprise to observe some noise that cannot be directly learned in our Qfunction. For most crowdsourcing problems the number of tasks
is large, so we can leverage the central limit theorem to justify our modeling of the noise using a Gaussian process. To be more specific, we calculate the temporal difference (TD) error as
(9) 
where the noise follows a Gaussian process, and denotes the current policy. By doing so, we gain two benefits. First, the approximation greatly simplifies the derivation of the update equation for the Qfunction. Secondly, as shown in our empirical results later, this kind of approximation is robust against different worker models. Besides, following gasic2014gaussian we approximate Qfunction as , where also follows a Gaussian process.
Under the Gaussian process approximation, all the observed rewards and the corresponding values up to the current step form a system of equations, and it can be written as , where , and denote the collection of rewards, values, and residuals. Following Gaussian process’s assumption for residuals, , where . The matrix satisfies and for . Then, by using the online Gaussian process regression algorithm engel2005reinforcement , we effectively learn the Qfunction as
(10) 
where and . Here, we use to denote the Gaussian kernel. Finally, we employ the classic greedy method to decide based on the learned Qfunction. To summarize, we provide a formal description about RIL in Algorithm 2. Note that, when updating , and in Line 6, we employ the sparse approximation proposed in gasic2014gaussian to discard some data so that the size of these matrices does not increase infinitely.
5 Theoretical Analysis on Incentive Compatibility
In this section, we prove the incentive compatibility of our Bayesian inference and reinforcement learning algorithms. Our main results are as follows:
Theorem 2 (One Step IC).
At any time step , when , reporting truthfully and exerting high efforts is the utilitymaximizing strategy for any worker at equilibrium (if other workers all follow this strategy).
In Appendix E, we prove that when , if , any worker ’s utilitymaximizing strategy would be reporting truthfully and exerting high efforts. Since Theorem 1 has provided the convergence guarantee, we can conclude Theorem 2. ∎
Theorem 3 (Long Term IC).
Suppose the conditions in Theorem 2 are satisfied and the learned function approaches the real . When the following equation holds for ,
(11) 
always reporting truthfully and exerting high efforts is the utilitymaximizing strategy for any worker in the long term if other workers all follow this strategy. Here, denotes the minimal gap between two available values of the scaling factor.
In order to induce RIL to change actions, worker must let RIL learn a wrong function. Thus, our main idea of proof is to derive the upper bounds of the effects of worker ’s reports on the function. Besides, Theorem 3 points that, to design robust reinforcement learning algorithms against the manipulation of strategical agents, we should leave a certain level of gaps between actions. This observation may be of independent interests to reinforcement learning researchers.
6 Empirical Experiments
In this section, we empirically investigate the competitiveness of our solution. To be more specific, we first show our proposed Bayesian inference algorithm can produce more accurate estimates about the aggregated label accuracy when compared with the existing inference algorithms. Then, we demonstrate that, aided by Bayesian inference, our RIL algorithm consistently manages to learn a good incentive policy under various worker models. Lastly, we show as a bonus benefit of our mechanism that, leveraging Bayesian inference to fully exploit the information contained in the collected labels leads to more robust and lowervariance payments at each step.
6.1 Empirical Analysis on Bayesian Inference
The aggregated label accuracy estimated from our Bayesian inference algorithm serves as a major component of the state representation and reward function to RIL, and thus critically affects the performance of our mechanism. Given so, we choose to first investigate the bias of our Bayesian inference algorithm. In Figure 1(a), we compare our Bayesian inference algorithm with two popular inference algorithms in crowdsourcing, that is, the EM estimator raykar2010learning and the variational inference estimator liu2012variational . Here, we employ the famous RTE dataset, where workers need to check whether a hypothesis sentence can be inferred from the provided sentence snow2008cheap . In order to simulate strategic behaviors of workers, we mix these data with random noise by replacing a part of realworld labels with uniformly generated ones (low quality labels).
From the figure, we conclude that compared with EM and variational inference, our Bayesian inference algorithm can significantly lower the bias of the estimates of the aggregated label accuracy. In fact, we cannot use the estimates from the EM and variational inference as alternatives for the reward signal because the biases of their estimates even reach while the range of the label accuracy is only between . This set of experiments justifies our motivation to develop our own inference algorithm and reinforces our claim that our inference algorithm could provide fundamentals for the further development of potential learning algorithms for crowdsourcing.



Method  Rational  QR  MWU 

Fixed Optimal  27.584 (.253)  21.004 (.012)  11.723 (.514) 
Heuristic Optimal  27.643 (.174)  21.006 (.001)  12.304 (.515) 
Adaptive Optimal  27.835 (.209)  21.314 (.011)  17.511 (.427) 
RIL  27.184 (.336)  21.016 (.018)  15.726 (.416) 
Performance comparison under three worker models. Data requester’s cumulative utility normalized over the number of tasks. Standard deviation reported in parenthesis.
6.2 Empirical Analysis on RIL
We move on to investigate whether RIL consistently learns a good policy, which maximizes the data requester’s cumulative utility . For all the experiments in this subsection, we set , , , , , the set of the scaling factor , the exploration rate for RIL and , for the utility function (Eqn. (3)) and the number of time steps for an episode as
. We report the averaged results over 5 runs to reduce the effect of outliers. To demonstrate our algorithm’s general applicability, we test it under three different worker models, each representing a popular family of human behavioral model. We provide a simple description of them as follows, whereas the detailed version is deferred to Appendix H. (i)
Rational workers alway take the utilitymaximizing strategies. (ii) QR workers mckelvey1995quantal follow strategies corresponding to an utilitydependent distribution (which is predetermined). This model has been used to study agents with bounded rationality. (iii) MWU workers littlestone1994weighted update their strategies according to the celebrated multiplicative weights update algorithm. This model has been used to study adaptive learning agents.Our first set of experiments is a continuation to the last subsection. To be more specific, we first focus on the estimation bias of the data requester’s cumulative utility . This value is used as the reward in RIL and is calculated from the estimates of the aggregated label accuracy. This set of experiments aim to investigate whether our RIL module successfully leverages the label accuracy estimates and picks up the right reward signal. As Figure 1(b) shows, the estimates only deviate from the real values in a very limited magnitude after a few episodes of learning, regardless of which worker model the experiments run with. The results further demonstrate that our RIL module observe reliable rewards. The next set of experiments is about how quickly RIL learns. As Figure 1(c) shows, under all three worker models, RIL manages to pick up and stick to a promising policy in less than 100 episodes. This observation also demonstrates the robustness of RIL under different environments.
Our last set of experiments in this subsection aim to evaluate the competitiveness of the policy learned by RIL. In Table 1, we use the policy learned after 500 episodes with exploration rate turned off (i.e. ) and compare it with three benchmarks constructed by ourselves. To create the first one, Fixed Optimal, we try all 4 possible fixed value for the scaling factor and report the highest cumulative reward realized by either of them. To create the second one, Heuristic Optimal, we divide the value region of into five regions: , , , and . For each region, we select a fixed value for the scaling factor . We traverse all possible combinations to decide the optimal heuristic strategy. To create the third one, Adaptive Optimal, we change the scaling factor every steps and report the highest cumulative reward via traversing all possible configurations. This benchmark is infeasible to be reproduced in realworld practice, once the number of steps becomes large. Yet it is very close to the global optimal in the sequential setting. As Table 1 demonstrates, the two benchmarks plus RIL all achieve a similar performance tested under rational and QR workers. This is because these two kinds of workers have a fixed pattern in responding to incentives and thus the optimal policy would be a fixed scaling factor throughout the whole episode. On contrast, MWU workers adaptively learn utilitymaximizing strategies gradually, and the learning process is affected by the incentives. Under this worker environment, RIL managers to achieve an average utility score of , which is a significant improvement over fixed optimal and heuristic optimal (which achieve and respectively) considering the unrealistic global optimal is only around . Up to this point, with three sets of experiments, we demonstrate the competitiveness of RIL and its robustness under different work environments. Note that, when constructing the benchmarks, we also conduct experiments on DG13, the stateoftheart peer prediction mechanism for binary labels dasgupta2013crowdsourced , and get the same conclusion. For example, when DG13 and MWU workers are tested for Fixed Optimal and Heuristic Optimal, the cumulative utilities are and , respectively, which also shows a large gap with RIL.
6.3 Empirical Analysis on One Step Payments
In this subsection, we compare the one step payments provided by our mechanism with the payments calculated by DG13, the stateoftheart peer prediction mechanism for binary labels dasgupta2013crowdsourced . We fix the scaling factor and set , , , and . To set up the experiments, we generate task ’s true label following its distribution (to be specified) and worker ’s label for task based on ’s PoBC and . In Figure 2(a), we let all workers excluding report truthfully and exert high efforts (i.e. ), and increase from to . In Figure 2(b), we let , and increase other workers’ PoBCs from to . As both figures reveal, in our mechanism, the payment for worker almost only depends on his/her own strategy. On contrast, in DG13, the payments are clearly affected by the distribution of true labels and the strategies of other workers. In other words, our Bayesian inference is more robust to different environments. Furthermore, in Figure 2(c), we present the standard deviation of the payment to worker . We let , and increase from to . As shown in the figure, our method manages to achieve a noticeably smaller standard deviation compared to DG13. Note that, in Figure 2(b), we implicitly assume that most of workers will at least not adversarially report false labels, which is widelyadopted in previous studies liu2012variational . For workers’ collusion attacks, we also have some defending tricks provided in Appendix F.



7 Conclusion
In this paper, we build an inferenceaided reinforcement mechanism leveraging Bayesian inference and reinforcement learning techniques to learn the optimal policy to incentivize highquality labels from crowdsourcing. Our mechanism is proved to be incentive compatible. Empirically, we show that our Bayesian inference algorithm can help improve the robustness and lower the variance of payments, which are favorable properties in practice. Meanwhile, our reinforcement incentive learning (RIL) algorithm ensures our mechanism to perform consistently well under different worker models.
Acknowledgments
This work was conducted within RollsRoyce@NTU Corporate Lab with support from the National Research Foundation (NRF) Singapore under the Corp Lab@University Scheme. Yitao is partially supported by NSF grants #IIS1657613, #IIS1633857 and DARPA XAI grant #N660011724032. The authors also thank Anxiang Zeng from Alibaba Group for valuable discussions.
References

[1]
R. Arratia and L. Gordon.
Tutorial on large deviations for the binomial distribution.
Bulletin of Mathematical Biology, 51(1):125–131, Jan 1989.  [2] Erick Chastain, Adi Livnat, Christos Papadimitriou, and Umesh Vazirani. Algorithms, games, and evolution. PNAS, 111(29):10620–10623, 2014.
 [3] Xi Chen, Qihang Lin, and Dengyong Zhou. Statistical decision making for optimal budget allocation in crowd labeling. Journal of Machine Learning Research, 16:1–46, 2015.
 [4] Anirban Dasgupta and Arpita Ghosh. Crowdsourced judgement elicitation with endogenous proficiency. In Proc. of WWW, 2013.
 [5] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pages 20–28, 1979.
 [6] Djellel Eddine Difallah, Michele Catasta, Gianluca Demartini, Panagiotis G Ipeirotis, and Philippe CudréMauroux. The dynamics of microtask crowdsourcing: The case of amazon mturk. In Proc. of WWW, 2015.
 [7] Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with gaussian processes. In Proc. of ICML, 2005.
 [8] Milica Gasic and Steve Young. Gaussian processes for pomdpbased dialogue manager optimization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1):28–40, 2014.
 [9] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
 [10] Geoffrey J. Gordon. Reinforcement Learning with Function Approximation Converges to a Region. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pages 1040–1046, 2000.
 [11] Jeff Howe. The rise of crowdsourcing. Wired Magazine, 14(6), 06 2006.

[12]
Radu Jurca, Boi Faltings, et al.
Mechanisms for making crowds truthful.
Journal of Artificial Intelligence Research
, 34(1):209, 2009.  [13] Joel Z. Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multiagent reinforcement learning in sequential social dilemmas. In Proc. of AAMAS, 2017.
 [14] Yitao Liang, Marlos C. Machado, Erik Talvitie, and Michael Bowling. State of the art control of atari games using shallow reinforcement learning. In Proc. of AAMAS, 2016.
 [15] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
 [16] Qiang Liu, Jian Peng, and Alexander T Ihler. Variational inference for crowdsourcing. In Proc. of NIPS, 2012.
 [17] Yang Liu and Yiling Chen. Machinelearning aided peer prediction. In Proc. of ACM EC, 2017.
 [18] Yang Liu and Yiling Chen. Sequential peer prediction: Learning to elicit effort using posted prices. In Proc. of AAAI, pages 607–613, 2017.
 [19] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995.
 [20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel Control through Deep Reinforcement Learning. Nature, 518(7540):529–533, 02 2015.
 [21] Frank W. J. Olver. NIST Handbook of Mathematical Functions. Cambridge University Press, 2010.
 [22] Dražen Prelec. A bayesian truth serum for subjective data. Science, 306(5695):462–466, 2004.
 [23] Goran Radanovic and Boi Faltings. A robust bayesian truth serum for nonbinary signals. In Proc. of AAAI, 2013.
 [24] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.
 [25] Vidyasagar Sadhu, Gabriel SallesLoustau, Dario Pompili, Saman A. Zonouz, and Vincent Sritapan. Argus: Smartphoneenabled human cooperation via multiagent reinforcement learning for disaster situational awareness. In Proc. of ICAC, 2016.
 [26] Victor S Sheng, Foster Provost, and Panagiotis G Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proc. of SIGKDD, 2008.
 [27] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354 EP –, 10 2017.
 [28] Edwin D Simpson, Matteo Venanzi, Steven Reece, Pushmeet Kohli, John Guiver, Stephen J Roberts, and Nicholas R Jennings. Language understanding in the wild: Combining crowdsourcing and machine learning. In Proc. of WWW, 2015.
 [29] Aleksandrs Slivkins and Jennifer Wortman Vaughan. Online decision making in crowdsourcing markets: Theoretical challenges. ACM SIGecom Exchanges, 12(2):4–23, 2014.
 [30] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. Cheap and fast—but is it good?: evaluating nonexpert annotations for natural language tasks. In Proc. of EMNLP, 2008.
 [31] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
 [32] Csaba Szepesvári. Algorithms for Reinforcement Learning. Synthesis lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2010.
 [33] Gerald Tesauro. Temporal Difference Learning and TDGammon. Communications of the ACM, 38(3):58–68, March 1995.
 [34] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, 2016.
 [35] Yue Wang and Fumin Zhang. Trends in Control and DecisionMaking for HumanRobot Collaboration Systems. Springer Publishing Company, Incorporated, 1st edition, 2017.
 [36] Christopher J. C. H. Watkins and Peter Dayan. Technical Note: QLearning. Machine Learning, 8(34), May 1992.
 [37] Jens Witkowski and David C Parkes. Peer prediction without a common prior. In Proc. of ACM EC, 2012.
 [38] Chao Yu, Minjie Zhang, and Fenghui Ren. Emotional multiagent reinforcement learning in social dilemmas. In PRIMA, 2013.
 [39] Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Proc. of NIPS, 2014.
 [40] Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. Truth inference in crowdsourcing: is the problem solved? Proc. of the VLDB Endowment, 10(5):541–552, 2017.
 [41] Dengyong Zhou, Qiang Liu, John Platt, and Christopher Meek. Aggregating ordinal labels from crowds by minimax conditional entropy. In Proc. of ICML, 2014.
 [42] Denny Zhou, Sumit Basu, Yi Mao, and John C Platt. Learning from the wisdom of crowds by minimax entropy. In Proc. of NIPS, 2012.
Appendix
A Derivation of Posterior Distribution
It is not had to figure out the joint distribution of the collected labels
and the true labels(12) 
where and . and denote the distribution of true label and , respectively. Besides, and . Then, the joint distribution of , , and
(13) 
where denotes the beta function, and
In this case, we can conduct marginalization via integrating the joint distribution over and as
(14) 
where and . Following Bayes’ theorem, we can know that
(15) 
B Proof for Lemma 1
b.1 Basic Lemmas
We firstly present some lemmas for our proof later.
Lemma 2.
If , holds for any , where is the binomial distribution.
Lemma 3.
For given , if , we can have
Lemma 4.
.
Lemma 5.
.
Lemma 6.
.
Lemma 7.
.
Lemma 8.
.
Lemma 9.
If , we can have
where .
To prove the lemmas above, we firstly define
(20) 
Then, Lemma 4 can be obtained by expanding . Lemma 5 can be proved as follows
(21) 
Lemma 6 can be obtained as follows
(22) 
For Lemma 7, we can have
(23) 
Thus, we can have
(24) 
which concludes Lemma 7. Then, Lemma 8 can be obtained by considering Eqn. (25).
(25) 
For Lemma 9, we can have
(26) 
where . Let , we can have
(27) 
Since , and . Considering Hoeffding’s inequality, we can get
(28) 
which concludes the first inequality in Lemma 9. Similarly, for the second inequality, we can have
(29) 
where . Suppose , we can have
(30) 
Considering Hoeffding’s inequality, we can also get
(31) 
which concludes the second inequality in Lemma 9. ∎
Lemma 10.
For any , we can have
Firstly, we can know . Let . Then, we can have and . Thus, and we can conclude Lemma 10 by taking this inequality into the equality. ∎
Lemma 11.
is a concave function when .
, where . when . Thus, is monotonically decreasing when , which concludes Lemma 11. ∎
Lemma 12.
For ,
satisfies
When , we can have
(32) 
When , we can have
(33) 
∎
Comments
There are no comments yet.