1 Introduction and related work
Truth inference algorithms [zheng2017truth]
have been heavily explored in crowdsourcing to aggregate and make sense of workers’ contributions. Most stateoftheart algorithms (e.g., majority voting, expectation maximization
[dawid1979maximum], message passing [karger2011iterative]) are computed expost, i.e. all contributions are first collected and then aggregated, usually by means of iterative algorithms to infer the truth and estimate worker quality until convergence; this requires setting apriori the number of repetitions of user labeling on each task, possibly collecting redundant information.
Variations of truth inference algorithm include scheduling approaches to optimize task assignment to workers, especially when micropayment budget is an issue [karger2014budget, han2017budgeted], and assessment of workers’ skills to improve answer quality, especially when tasks are very varied or have diverse levels of difficulty [allahbakhsh2013quality, difallah2013pick, yang2016modeling]. Related investigation exists on the evaluation of repeated labeling strategies [sheng2008get] to understand when it is more convenient to stop collecting user contributions; in that work, strong assumptions are made with respect to user accuracy and task difficulty, considered constant across examples; however, those premises do not always hold in practical settings.
A Game with a Purpose or GWAP [von2008designing] is a wellknown Human Computation approach [law2011human] to encourage users to execute tasks through entertainment. We can see GWAPs as a special crowdsourcing mechanisms in which workers are actually players, rewarded with fun instead of micropayments by leveraging intrinsic motivation [ryan2000intrinsic]. Players are attracted and motivated by the game itself and often they are not even aware that their actions in the game play are exploited to produce the “collateral effect” of solving tasks.
Aggregating users’ contribution is a key issue also in Human Computation systems like GWAPs. Originally, aggregation was based on simple agreement: in the ESP game [von2004labeling], the very first GWAP ever released, players typed in textual labels to tag images and two agreeing users were enough to consider the label “true”. Afterwards, “ground truth” tasks, i.e. problems with known solution, were introduced to check the quality of contributions to cope with random answers or malicious players [quinn2011human, ul2013effects].
In most crowdsourcing platforms (like Amazon Mechanical Turk^{1}^{1}1Cf. https://www.mturk.com/. or Figure Eight^{2}^{2}2Cf. https://www.figureeight.com/.), tasks are assigned in batches or Human Intelligence Tasks (HITs) and workers are required to submit their answers within a specific timeframe in order to be eligible for payment [brabham2013crowdsourcing]. In contrast, in GWAPs contributions are collected as soon as a user decides to play the game: the flow of incoming answers is therefore subject to the “appreciation” of the game by players and a longtail effect is very often recorded, with a few players playing a lot of rounds and the majority of participants being active for a few minutes only. Therefore, it is of utmost importance to exploit every single player’s contribution and to infer truth in an incremental way, assigning the same task to the minimum sufficient number of different players.
The remainder is organized as follows: we give preliminaries and requirements for problem formulation in Section 2; we describe our approach in Section 3 with the algorithm and its qualitative assessment; Section 4 presents a quantitative evaluation in comparison to baselines; Section 5 concludes the paper.
2 Problem formulation
In this section we formulate the problem, by giving some definitions and listing the requirements that the truth inference algorithm should fulfill. For simplicity of explanation, we specifically consider the case of multinomial classification tasks (with a predefined set of labels), but the approach can be easily extended to open labelling with no loss of generality.
2.1 Definitions
We consider a Game with a Purpose aimed to solve a set of tasks . Each task is a labelling task, in which a label is assigned from a set of admissible values .
The GWAP is played by a set of users . In each game round, a player is assigned a subset of tasks to be solved. Given a set of “ground truth” tasks for which the solution is known, in each game round the player is also given a set of control tasks. The answers to control tasks are used to estimate the reliability of the player, which is useful to “weight” contributions on unsolved tasks during truth inference.
Player contributions are collected in a matrix , initialized with null or zero values and filled with labels from whenever a player completes a task. The goal of the GWAP is not to completely fill up , on the contrary should remain a sparse matrix, with the minimum possible number of players contributions (i.e., nonzero values) required to infer the “true” labels for the tasks.
Finally, truth inference is a function applied on players’ answers and reliability values to infer the result set for each of the tasks in . is computed by aggregation of users’ contributions and is an estimate of the “true” unknown labelling of the tasks. Truth inference is incremental if, at each new contribution from a GWAP player, a new estimation of is computed.
To understand if a task can be considered completed, truth inference computes a set of scores representing the confidence values on the association between and each possible labelling value . In other words, the aggregation algorithm builds and updates a matrix of estimation scores . As in record linkage literature [fellegi1969theory], those scores start from , and are incrementally increased according to user contributions. Each task is solved when the maximum of its scores (i.e. the th row of matrix ) overcomes some threshold ; the “completion” condition can be therefore formulated as follows:
(1) 
Truth inference algorithms differ for their specific approach to update the matrix of scores when aggregating user contributions.
2.2 Requirements
As mentioned in the introduction, in the case of Games with a Purpose, some specific requirements emerge that motivate the need for a new truth inference approach:
 [R1]

Dynamic estimate of labeling quality
, by computing player reliability on control tasks: quality estimate is a usual issue in crowdsourcing, but microtask workers may solve all the assigned tasks at once; we would like to take into account that GWAP players can play the game in different moments with different levels of attention, hence their quality/reliability can change over time and cannot be computed once and for all.
 [R2]

Coping with varying difficulty of labeling task, including possibly multiple classification or even uncertain classification tasks, which means that we cannot make any apriori hypothesis on the number of redundant labelling actions required to solve each task.
 [R3]

Incremental computation of truth inference: as introduced in Section 1, in GWAPs we would like to aggregate contributions as soon as they are available, because there is no predefined timeframe for players’ input.
 [R4]

Dynamic minimization of the number of required repeated labeling, to avoid useless redundancy: if a task is “easy” we would like to ask fewer players to solve it, while if a task is “hard” we would like the task to remain longer in the game to be “played”.
3 Approach description
To address the above requirements, we define the framework illustrated in Figure 1. Each time a player starts a game round, we assign a set of tasks to be solved, some of which are control tasks. We collect the answers from the player and we compute his/her reliability. Then, for each unsolved task, we perform a step of truth inference, and we incrementally compute a new estimation of the task solution. If the new estimation is “good enough” (cf. exit condition of Equation 1), the task is considered solved and removed from the game and its result returned. Otherwise, the task is kept in the game and assigned to the next user/player.
3.1 Algorithm
The approach outlined above is explained in details in the following Algorithm 1. Each time a player starts a game round (line 2), he/she is assigned a set of tasks to be solved.
The player provides answers to each task without being able to distinguish between unsolved tasks and control tasks (cf. lines 6 and 14). The answers on control tasks are used to compute player’s reliability, which is a function of the number of mistakes (lines 511); reliability is computed per each game round. There are of course different ways to realize the ComputeUserReliability function of line 11: the simplest way is to use the percentage of correct labels in control tasks, i.e. size. In other cases, it may be safer to strongly penalize players which submit random answers; in the games that we employ in our evaluation (cf. Section 4), to have a conservative estimation, we adopted the following formula:
(2) 
where is set (for example) so that almost halves with 1 mistake and then quickly decreases with further errors.
On the other hand, the answers on unsolved tasks are weighted with the reliability value and used to update the estimation scores (lines 1415); for each task and for each possible label , the UpdateSolutionEstimate function is implemented as follows:
(3) 
where is the label contributed by the user with reliability and is an increment that depends on the minimum redundancy required for the task.
At each truth inference step, the task completion condition is checked (line 16) with Equation 1 and, if it holds, the task solution is returned and the task removed from the game (lines 1718). The algorithm iterates until all tasks are solved (line 1) and truth is inferred on all tasks (line 22).
3.2 Requirement satisfaction
Qualitatively, we now assess how the approach presented in this section addresses the requirements listed in Section 2.2 and we discuss some of its positive consequences.
Labeling quality is controlled via the updates of the estimation scores , incremented with players’ contributions which are weighted with the reliability values . This means that the proposed approach takes into consideration the quality of contributions and “measures” it at each game play, thus relying on a “local” trustworthiness value; the dynamic recomputation of fulfills requirement [R1], by addressing the fact that the same player can show a different behaviour in different moments of his/her playing, e.g. being careful vs. distracted.
The estimation scores , their update function (cf. Equation 3) and the task completion condition (cf. Equation 1) have also other interesting properties. The scores are attributed to each tasklabel combination and updated at each user contribution.
If a task is “easy”, different players will attribute the same label and the respective score will quickly increase and overcome the threshold of the exit condition. On the contrary, if a labelling task is difficult or controversial, different GWAP players may give different solutions from the set to the same task , so potentially all scores in get updated but none of them easily overcomes .
In other words, the proposed approach fulfills requirements [R2] on task difficulty, because easy and difficult tasks are automatically detected and treated accordingly, and [R4] on repeated labelling, as the number of players asked to solve the same task is dynamically adjusted.
It is worth noting that in record linkage literature [fellegi1969theory], scores are assigned to each possible couple of records, and usually the “matching” score is increased while the “nonmatching” scores are decreased respectively. In the cases of possibly multiple labeling and uncertain solutions (cf. requirement [R2]), we propose to increase the score of the userprovided solution, without decreasing the score of the alternative solutions. Of course, variations of the update function in Equation 3 can be introduced, depending on the scenario characteristics. For example, if , then could be decreased of a quantity , where is the decrement amount.
By design, Algorithm 1 fulfills requirement [R3], since each player contribution (line 14) triggers a step of the truth inference estimate (line 15) and leads to the exit condition check (line 16). This incremental approach ensures that the task is assigned to players only until an inferred “true” solution is reached, thus avoiding useless redundancy of labelling (again satisfying requirement [R4]).
The dynamically adjusted repeated labelling has also the consequence of indirectly estimating task complexity: indeed we can say that the more contributions are needed to satisfy the exit condition of Equation 1, the more difficult the task. Therefore, whenever an assessment of the task difficulty is required, the number of collected contributions can be adopted as a proxy measure. In our previous work [re2018human]
we indeed demonstrated that this empirical measure of difficulty is highly correlated with the (lack of) confidence value resulting from machine learning classifiers applied to the same data.
A final note on task assignment: it is a common best practice to give each task to a crowd worker at most only once and to perform answer aggregation on responses from different workers; this is also true for GWAPs, in that the same player could get bored if requested to solve the same problem over and over. This means that task assignment to player (lines 3 and 12) takes tasks from and respectively among those that never solved before. A pragmatic strategy to avoid using up the entire set of control tasks, that we usually adopt when implementing GWAPs, is to dynamically increment by adding the solved tasks from the set (those removed when the “true” solution is inferred), so line 18 could become: .
4 Evaluation
To evaluate the proposed truth inference algorithm we performed a comparative assessment with alternative solutions, on the basis of the data collected through two different GWAPs: the LCV Game [brovelli2018crowdsourcing] and Night Knights [re2018human].
The Land Cover Validation (LCV) Game^{3}^{3}3Cf. http://landcover.como.polimi.it/landcover/. addresses a multinomial classification of items with 5 different labels; domain experts required a minimum of 3 different and agreeing contributions for each item classification. Night Knights^{4}^{4}4Cf. https://www.nightknights.eu/. asks players to classify pictures with one of 6 admissible labels; at least 4 agreeing contribution from different users were requested by experts on the basis of domainspecific considerations.
A first evaluation of our approach is based on the total number of contributions to be collected (in line with requirement [R4]). In most crowdsourcing settings, where aggregation is computed expost, a fixed number of contributions is collected per each task. Let’s consider the multinomial classification of tasks with admissible labels, with a minimum of agreeing labels per task. To implement an expost aggregation with simple majority voting, the total number of needed contributions is the redundancy computed as
(4) 
Moreover, in traditional microwork/crowdsourcing settings, there is experimental evidence of 4045% of spammers among crowd workers [shah2010spam, vuurens2011much], thus redundancy could be even higher than the one computed in Equation 4.
Table 1 shows the theoretical and empirical numbers for LCV Game and Night Knights: the incremental approach that we propose leads to a sensible “saving” in terms of redundancy, since whenever the minimum number of contribution is enough to consider the task solved, no more labels are sought.
GWAP 


% diff.  

LCV Game  1,000  5  3  11,500  6,400  44%  
Night Knights  27,700  6  4  525,000  205,000  61% 
Finally, to assess the ability of our incremental approach to infer the truth, we applied stateoftheart algorithms for expost data aggregation and compared the resulting classification on the contribution collected by our GWAP. Namely, we run expectation maximization [dawid1979maximum] and message passing [karger2011iterative]
, which are the most frequently used truth inference algorithms; then, we compared the aggregated labels with a confusion matrix. The results reported in Table
2 show that indeed the overlap between the “truths” inferred with the compared algorithms is very high and the agreement statistics confirm it. This proves the validity and applicability of our approach.GWAP  Algorithm  % diff.  Accuracy  Kappa  Rand 

LCV Game  EM  3.9%  96.1%  93.4%  88.7% 
MP  3.1%  96.9%  94.7%  90.6%  
Night Knights  EM  0.3%  99.7%  99.4%  99.4% 
MP  0.2%  99.8%  99.6%  99.6% 
5 Conclusions
In this paper, we proposed an incremental algorithm for truth inference that satisfies the requirements emerging from the aggregation of player contributions in Games with a Purpose. We explained and described our approach in details, highlighting the practical consequences and advantages, including the avoidance of useless redundancy with the minimization of required task solutions, and the dynamic estimation of player reliability, label quality and task difficulty.
We also presented a comparative evaluation of the presented approach on actual data collected through two different GWAP applications, which proves the applicability and advantages of the proposed incremental truth inference.
It is worth noting that we also released as open source the GWAP Enabler [re2018framework], a software framework to build Games with a Purpose which implements the incremental truth inference approach outlined in this paper. The two games mentioned in the evaluation section were developed on top of this framework. The interested reader can find on GitHub both the software framework^{5}^{5}5Cf. https://github.com/STARS4ALL/gwapenabler. and a tutorial explaining how to use and configure it^{6}^{6}6Cf. https://github.com/STARS4ALL/gwapenablertutorial..
Acknowledgments
This work is partially supported by the STARS4ALL project (H2020688135), cofunded by the European Commission. We thank Andrea Fiano for the implementation of the Night Knights game, Alejandro Sánchez de Miguel and Lucía García for their support in the interpretation of our experimental results from a light pollution research point of view and Esteban González Guardia for the retrieval and provision of images from the NASA repository.