Crowdsourcing has been the primary enabling mechanism behind the generation of many valuable Internet resources. Wikipedia, Freebase, and other knowledge repositories were created by volunteers who contributed knowledge about a wide variety of topics. Other human computation applications engage users in creative ways to generate interesting and useful by-products of the engagement. The ESP Game  asks users to participate in a game that generates useful image tags. ReCAPTCHA  verifies that users are humans by asking them to transcribe letters from a distorted image, thus helping with the digitization of books. Duolingo111http://www.duolingo.com/ teaches users a new language and as a by-product generates translations of written material in different languages. However, despite these widely-known success stories, building and engaging a community of users is a challenging task. Coming up with creative engagement strategies (e.g., ESP Game) is difficult, and spawning successful crowd-powered sites such as Wikipedia often seems like the exception rather than the norm.
In order to sidestep the problem, many efforts rely on paid crowdsourcing; for example, hiring workers through platforms such as Amazon Mechanical Turk allows for direct engagement of users, with a clear monetary incentive. Unfortunately, the introduction of money as a predictable and repeatable motivator is a mixed blessing. Users who are motivated by monetary rewards are often different than the users who are unpaid and motivated by other means [22, 28, 40]. Furthermore, studies indicate that the use of monetary rewards can be highly detrimental for users who are already intrinsically motivated : the introduction of monetary compensation reveals to the users exactly how much their work is valued by the task requester, and low payments make things worse .
Finally, even if the incentive problem is solved, how does one attract the crowd that is properly qualified for a given task? The workers participating in paid crowdsourcing are typically non-expert users, and often lack the skills needed for the crowdsourcing effort. For example, if one’s task calls for Swahili speakers or maxillofacial surgeons, then most labor marketplaces do not even provide access to such users, or only have few users with required expertise.
Thus, a set of natural challenging questions emerge. Can we replicate the predictability of paid crowdsourcing in terms of attracting participation, while engaging unpaid users? And how can we identify and incentivize experts among these users, who match the needs of the application at hand?
Here we propose to use existing Internet advertising platforms for targeting and attracting users, with the suitable expertise for the task at hand. Over the last decade, advertising platforms have improved their targeting capabilities to identify users who are good matches for the goals of the advertiser. In our case, we initiate the process with simple advertising campaigns but also integrate the ad campaign with the crowdsourcing application, and provide feedback to the advertising system for each ad click: The feedback indicates whether the user, who clicked on the ad, “converted” and the total contributions of the crowdsourcing effort. This allows the advertising platform to naturally identify websites with user communities that are good matches for the given task. For example, in our experiments with acquiring medical knowledge, we initially believed that “regular” Internet users would not have the necessary expertise. However, the advertising system automatically identified sites such as Mayo Clinic and HealthLine, which are frequented by knowledgeable consumers of health information who ended up contributing significant amounts of high-quality medical knowledge. Our idea is inspired by Hoffman et al. , who used advertising to attract users to a Wikipedia-editing experiment, although they did not attempt to target users nor attempted to optimize the ad campaign by providing feedback to the advertising platform.
Once users arrive at our site, we need to engage them to contribute useful information. Our crowdsourcing platform, Quizz, invites users to test their knowledge in a variety of domains and see how they fare against other users. Figure 1 shows an example question. Our quizzes include two kinds of questions: Calibration questions have known answers, and are used to assess the expertise and reliability of the users. On the other hand, collection
questions have no known answers and actually serve to collect new information, and our platform identifies the correct answers based on the answers provided by the (competent) participants. To optimize how often to test the user, and how often to present a question with an unknown answer, we use a Markov Decision Process, which formalizes the exploration/exploitation framework and selects the optimal strategy at each point.
As our analysis shows, a key component for the success of the crowdsourcing effort is not just getting users to participate, but also to keep the good users participating for long, while gently discouraging low-quality users from participating. In a series of controlled experiments, involving tens of thousands of users, we show that a key advantage of attracting unpaid users through advertising is the strong self-selection of high-quality users to continue contributing, while low-quality users self-select to drop out. Furthermore, our experimental comparison with paid crowdsourcing (both paid hourly and paid piecemeal) shows that our approach dominates paid crowdsourcing both in terms of the quality of users and in terms of the total monetary cost required to complete the task.
The contributions of this paper are fourfold. First, we formulate the notion of targeted crowdsourcing, which allows one to identify crowds of users with desired expertise. We then describe a practical approach to find such users at scale by leveraging existing advertising systems. Second, we show how to optimally ask questions to the users, to leverage their knowledge. Third, we evaluate the utility of a host of different engagement mechanisms, which incentivize users to contribute more high-quality answers via the introduction of short-term goals and rewards. Finally, our empirical results confirm that the proposed approach allows to collect and curate knowledge with accuracy that is superior to that of paid crowdsourcing mechanisms at the same or lower cost.
Figure 2 shows the overview of the system, and the various components that we discuss in the paper. Section 2 describes the use of advertising to target promising users, and how we set up the campaigns to allow for continuous, automatic optimization of the results over time. Section 3 shows the details of our information-theoretic scheme for measuring the expertise of the participants, while Section 4 gives the details of our exploration-exploitation scheme. Section 5 discusses our experiments on how to keep users engaged, and Section 6 gives the details of our experimental results. Finally, Section 7 describes related work, while Section 8 concludes.
2 Advertising for Targeting Users
A key problem of every crowdsourcing effort is soliciting users to participate. At a fundamental level, it is always preferable to attract users that have an inherent motivation for participation. Unfortunately, repeating the successes of efforts such as Wikipedia, TripAdvisor, and Yelp seems more of an art than a science, and we do not yet fully understand how to create engaging and viral crowdsourcing applications in a replicable manner. The emergence of paid crowdsourcing (e.g., Amazon Mechanical Turk) allows direct engagement of users in exchange for monetary rewards. However, the population of users who participate due to extrinsic rewards is typically different from the users who participate because of their intrinsic motivation.
Quizz uses online advertising to attract unpaid users to contribute. By running ads, we get into the middle ground between paid and unpaid crowdsourcing. Users who arrive at our site through an ad are not getting paid, and if they choose to participate they obviously do so because of their intrinsic motivation. This removes some of the wrong incentives and tends to alleviate concerns about indifferent users that “spam” the results just to get paid, or about workers that are trying to do the minimum work necessary in order to get paid. Thanks to the sheer reach of modern advertising platforms, the population of unpaid users can potentially be orders of magnitude larger than that in paid marketplaces. There are billions of users reachable through advertising, while even the biggest crowdsourcing platforms have at most a million users, many of them inactive [19, 18]. Therefore, if the need arises (and subject to budgetary constraints), our approach can elastically scale up to reach almost arbitrarily large populations of users, by simply increasing the budget allocated to the advertising campaign. At the same time, we show in Section 6 that our approach allows efficient use of the advertising budget (which is our only expenditure), and our overall costs are the same or lower than those in paid crowdsourcing installations.
A significant additional benefit of using an advertising system is its ability to target users with expertise in specific topics. For example, if we are looking for users possessing medical knowledge, we can run a simple ad like the one in Figure 3. To do so, we select keywords that describe the topic of interest and ask the advertising platform to place the ad in relevant contexts. In this study, we used Google AdWords222https://adwords.google.com, and opted into both search and display ads, while in principle we can use any other publicly available advertising system.
Selecting appropriate keywords for an ad campaign is a challenging topic in itself [13, 1, 20]. However, we believe that trying to optimize the campaign only through manually fine-tuning its keywords is of limited utility. Instead, we propose to automatically optimize the campaign by quantifying the behavior of the users that clicked on the ad. A user who clicks on the ad but does not participate in the crowdsourcing application is effectively “wasting” our advertising budget; using the advertising terminology, such user has not “converted.” Since we are not just interested in attracting any users but are interested in attracting users who contribute, we use Google Analytics333http://www.google.com/analytics to track user conversions. Every time a user clicks on the ad and then participates in a quiz, we record a conversion event, and send this signal back to the advertising system. This way, we are effectively asking the system to optimize the advertising campaign for maximizing the number of conversions and thus increasing our contribution yield, instead of the default optimization for the number of clicks.
Although optimizing for conversions is useful, it is even better to attract competent users (as opposed to, say, users who just go through the quiz without being knowledgeable about the topic). That is, we want to identify users who are both willing to participate and possess the relevant knowledge. In order to give this refined type of feedback to the advertising system, we need to measure both the quantity and the quality of user contributions, and for each conversion event report the true “value” of the conversion. To achieve this aim, we set up Google Analytics to treat our site as an e-commerce website, and for each conversion we also report its value. Section 3 describes in detail our approach to quantifying the values of conversions.
When the advertising system receives fine-grained feedback about conversions and their value, it can improve the ad placement and display the ad to users who are more likely to participate and contribute high quality answers. (In our experiments, in Section 6, this optimization led to an increase in conversion rate from 20% to over 50%, within a period of one month, for a campaign that was already well-optimized.) For example, consider medical quizzes. We initially believed that identifying users with medical expertise who are willing to participate in our system would be an impossible task. However, thanks to tracking conversions and modeling the value of user contributions, AdWords started displaying our ad on websites such as Mayo Clicic and HealthLine. These websites are not frequented by medical professionals but by prosumers. These users are both competent and are much more likely than professionals to participate in a quiz that assesses their medical knowledge—often, this is exactly the type of users that a crowdsourcing application is looking for.
3 Measuring User Contributions
In order to understand the contributions of a user for each quiz, we need first to define a measurement strategy. Measuring the user contribution using just the number of answers is problematic, as it does not consider the quality of the submissions. Similarly, if we just measure the quality of the submitted answers, we do not incentivize participation. Intuitively, we want users to contribute high quality answers, and also contribute many answers. Thus, we need a metric that increases as both quality and volume increase.
Information Gain: To combine both quality and quantity into a single, principled metric, we adopt an information-theoretic approach [36, 31]. We treat each user as a “noisy channel,” and measure the total information “transmitted” by the user during her participation. The information is measured as the information gain
contributed for each answer, multiplied by the total number of answers submitted by the user; this is the total information submitted by the user. More formally, assume that we know the probabilitythat the user answers correctly a randomly chosen question of the quiz. Then, the information gain is defined as:
where is the number of multiple choices in a quiz question. We use
to define the entropy444Note that the user can select among possible answers, and we assume that the error probabilities are uniformly
distributed among the
possible answers, and we assume that the error probabilities are uniformly distributed among theincorrect answers, each being selected with probability . for an answer:
When (user always gives perfect answers), then (i.e., no uncertainty), and if (user selects randomly from the possible answers) then .
Information Gain under Uncertainty: In our environment, the quality of a user is unknown. In fact, the goal of Quizz is to estimate for each user, by asking the users to answer a set of quiz questions. We can try to approximate with the ratio , where is the number of correct and is the number of incorrect answers for the user, but we face the problem of sparse data, especially during the early stages of the quiz when is relatively small.
Due to the uncertainty in measuring the exact quality of each user, we introduce a Bayesian version of the information gain metric. Specifically, we explicitly acknowledge the uncertainty of our measurements, and we treat the estimate of as a distribution, and not as a point estimate. The expected information gain when
is a random variable, we have:
In our system, we assume that is constant across questions and latent.555In the future, we can use Item Response Theory  and allow each question to have its own value. However, we observe the number of correct answers from the user; when is constant,
follows a binomial distribution. We use the vanilla Bayesian estimation strategy for estimating the probability of success in a binomial distribution. We set
(i.e., the uniform distribution), as the conjugate prior and then
is a Beta distribution.666Alternatively, we can use a mixture of Beta priors to encode better our prior knowledge about the distribution of user qualities . After the user submits correct and incorrect answers, follows the distribution:
with being the Beta function. After some algebraic manipulations, we have:
where is the digamma function. Figure 4 shows how
changes for different number of answers and for workers of varying competence. Following the same process, we also compute the variance of the information gain:
In the Appendix we list the detailed form of . In the next section, we discuss how we use these measurements to optimally decide between assessing the user’s knowledge and collecting new judgments.
4 Exploration / Exploitation
So far, we have described the setting where the user arrives and starts participating by answering quiz questions. Using the information gain metric, described in the previous section, we can estimate the amount of information that we can extract from a user if we ask a collection question, with an unknown (to us) answer. However, our goal is not just to estimate how much information we could get, but actually acquire new knowledge from the user. This creates a natural exploration-exploitation tradeoff. We can choose to “explore” how competent is the user, asking calibration questions, getting increasingly higher confidence about the user’s competence on a topic; or we can try to “exploit,” asking collection questions.
To formalize our decision making, we assume that the decision on whether to explore or exploit depends only on the current quiz that the user is solving and the current state of the user, which can be described by the number of correct answers , the number of incorrect answers , and the number of times
that we asked a collection question. Given the state vectorof the user, we use a Markov Decision Process (MDP) to select the next action to take, based on the following considerations.
User dropping out: At any point, the user may opt to abandon the application. When the users drops out, we do not obtain any additional utility; therefore an optimal set of actions should try to steer the user towards states with high probability of “survival.” (As we will see in Section 6, the probability of abandonment increases when the user gives incorrect answers to the quiz questions, and when the user does not receive feedback about the correctness of the submitted answer.) In our application, we estimate the probability
based on the empirically-observed “lifetimes” of users, using a non-parametric kernel-density estimator with Gaussian smoothing.777We use the KernSmooth package in R. (See Figure 6.)
Ask a collection question: When we select to ask a collection question, there are two components for the utility that the Quizz system receives. Namely, there is immediate utility of getting information about the potential answer for the question, and there is utility that we will accumulate from the future actions of the user. The former utility is equal to the expected information gain for the user given his current state vector (see Equation 3). However, we want to be more pessimistic about the utility estimates and assign more value to learning about the user competency. Following the “value of learning” approach , we therefore set the reward to , in order to encourage our application to learn more about the user before asking her to contribute new knowledge. The utility for the future steps is the utility for the state vector , as we asked one more collection question.
Ask a calibration question: When we select to ask a calibration question, we are trying to learn more about the competence of the user on the specific topic of the quiz. When presented with a question, the user may give either a correct or incorrect answer, which will lead to a revision of the and metrics. Although we do not get directly a utility by asking a collection question, the revised estimate of applies to all previously asked collection questions. Therefore, the where is the previous estimate of the utility of a collection question. Furthermore, the utility of future steps, is the stochastic sum of two possible forward paths: The utility when the user gives the correct answer (with probability ), and the utility when the user gives an incorrect answer.
Algorithm LABEL:algo:explore-exploit describes the implementation of this MDP. One immediate concern with this formulation is that the recursive algorithm definition points to states in the future, and hence the total reward could potentially be infinite. This concern is alleviated if we assume that the information gain in each step in bounded, and that the probability of survival . We know that the maximum information gain derived from a single question is and there is always a non-zero probability that the user will abandon the application. Therefore, the total utility that can be extracted from a single user is bounded by .
Another problem that arises with a recursive definition that points to future states is that the computational estimation becomes harder. Classic dynamic programming solutions assume a setting of backwards induction, where the recursion eventually leads to some initial state that has a known utility (e.g., the recursive computation of depends on with ). However, in our setting we have a forward induction, making the computation of recursion challenging. To allow the recursive computation to complete, we introduce a limited execution horizon for the recursion : once the recursion has exceeded that level of depth, we stop the computation and return. To find out whether the algorithm converges, we run the algorithm iteratively with increasing horizon, until observing that the actions and utility calculations converge. Empirically, the algorithm converges faster when the survival probabilities become smaller.
5 Engagement Incentives
In the previous section, we discussed how we can alternate between “exploration” and “exploitation” in order to assess the user’s competence and collect new information, respectively. A key requirement for the algorithm to work effectively is to have a reasonable level of participation from the users: if a user submits just a couple of answers, we cannot effectively assess the user’s competence or reliably collect new information.
Therefore, a key component of Quizz is the ability to continuously run experiments with various incentive mechanisms, that are trying to incentivize users to continue participating. Based on theories of intrinsic motivation , we implemented a variety of incentives, with the goal of prolonging the participation of competent users, while gently discouraging the non-knowledgeable users from submitting low-quality answers. Specifically, we tried the following options:
Feedback for submitted answer: We experimented with various types of feedback that we give back to the users. We tried giving no feedback, saying whether the answer was correct or not, and showing the correct answer. The basic hypothesis is that immediate performance feedback  should motivate competent users to continue participating, and potentially incentivize low-performing users to try to improve their performance.
Displaying scores: We experimented with displaying different types of scores to the user. We displayed the percentage of correct answers, the total number of correct answers, and a score based on the information gain, and combinations thereof.
Displaying crowd performance: We experimented with showing the crowd performance on each question (i.e., how many users answered the question correctly). We hypothesized that knowing how other users perform is going to increase the user effort.
In order to evaluate our system, we used the following metrics to quantify the level of user engagement and the quality of their contributions.
Conversion rate: We define conversion rate as the fraction of users who answered at least one quiz question, after clicking one of our ads (cf. Section 2). We use this metric to measure the effectiveness of our advertising.
User lifetime: We examine the number of (correct and incorrect) answers submitted by the users. We use this metric mainly to understand the effect of the various engagement incentives.
Total information gain: We measure the expected information gain of each user using Equation 3, and multiply this value by the total number of answers submitted by the user. The result is the total information that we received from the user.
Monetary cost per correct fact: We measure the total cost required to verify a fact at the 90%, 95%, and 99% estimated accuracy. To compute the cost for various levels of accuracy, we use the fact that if we know the quality of a contributor, we can estimate the required redundancy to reach the desired level of confidence [35, 15]. For example, if we have two users that are 90% accurate and we pay a cost of $0.10 per contributed answer, we need one such worker to verify a fact at 90% accuracy (i.e., cost $0.10 at 90% accuracy), and approximately two such workers to verify a fact at the 99% (i.e., cost $0.20 at 99% accuracy). To evaluate the correctness of the answers submitted by the users and the corresponding capacity of the system, we used questions with answers that have been pre-validated by multiple, trusted human judges. The costs are calculated based on the total advertising expenditure for attracting the users to the Quizz site, broken down by quiz.
6.2 Capacity and cost analysis
|Treatment Side Effects||605||5,044||$46.38||1.22||1.57||2.12||$0.13||$0.10||$0.07|
|Artist and Albums||310||1,548||$21.56||0.88||1.13||1.52||$0.16||$0.13||$0.09|
|Artist and Song||925||5,285||$236.26||0.96||1.23||1.66||$0.54||$0.42||$0.31|
Based on our measurements for September 2013, the conversion rate for the Quizz application was an average of 34.60%, resulting in a total of 4,091 engaged users out of 11,825 users that visited the application (having clicked an ad). The conversion rate increased steadily over time, starting at around 20% in the beginning of the month, and reaching a high of 51.25% on September 30th. (As we will discuss below, this is due to the continuous optimization from the conversion optimizer).
In our experiments we used eight different quizzes on various topics, and we report in Table 1
6.3 User contributions and self-selection
In terms of a per-user contribution, Figure 5 shows the distribution of total information gain across the participating users, and Figure 6 shows the lifetime of users as a function of their quality. As expected, many users come, submit a few answers, and then leave. These are the “head” users; although they do contribute some useful signal, they do not generate a great “return on investment.” Figure 7 further illustrates that the users that submit large number of answers also tend to submit more correct answers than incorrect. This means that the users who are competent about the topic submit more and more answers, while the ones who cannot answer the quiz questions correctly, drop out. This is an illustration of the benefit of unpaid users: there is little incentive for unpaid users to continue participating when there is no monetary reward and they are not good at the task. (We present a more detailed comparison with paid crowdsourcing in Section 6.7.)
6.4 The effect of targeting in advertising
A major hypothesis of our work is that the targeting system of existing advertising networks can be leveraged in order to identify competent users, who are willing to contribute new knowledge by answering quiz questions. We observe that users recruited through advertising are knowledgable and willing to contribute. However, it is not immediate obvious whether the positive result is due to targeting, or is simply the effect of bringing more users to the application.
In order to disentangle the effects of advertising and targeting, we ran two different advertising campaigns that both directed users to the same quiz. Both campaigns had the same budget, same ad creatives, same bidding settings, and their only differences were (1) the keywords used for the bidding and (2) the use of feedback to the advertising system. In the targeted campaign, we used keywords related to the topic of the quiz; for the untargeted campaign we used the keywords from all the quizzes available in the Quizz system. Also, in the untargeted campaign, we did not send feedback about conversions to avoid providing targeting information.
Figure 8 shows the results. While the number of visitors was roughly the same for the two campaigns, the targeted campaign had 3x higher conversion rate (34.62% vs. 13.43%). Furthermore, among the participating (“converted”) users, the number of questions answered per user was 3x higher for the users who arrived from the targeted campaign. Thus, the cumulative difference was over 9.2x more answers obtained through the targeted campaign compared to the untargeted one (2866 answers vs. 279). Finally, the answers contributed by the users from the targeted campaign were of higher quality than the answers from the untargeted campaign: the total information gain for the targeted campaign was 11.4x higher than the total information gain for the untargeted one (7560 bits vs. 610 bits), indicating a higher user competence, even on a normalized, per-question basis.
6.5 The effect of using conversion optimizer
After verifying that targeting and feedback indeed improve the results, we wanted to examine the effect of using the conversion optimizer. While traditional ad campaigns usually optimize for clicks, the conversion optimizer of Google AdWords offers the option to optimize for the total “value” of the conversions (in our case, for the total information gain). To examine the usefulness of the conversion optimizer in our setting, we again run two otherwise-identical ad campaigns: one being optimized for clicks, and the other being optimized for conversions.
The conversion rate increased by 30% when using the conversion optimizer (from 29% to 39%). In addition to that, the number of submitted answers went up by 42% (1683 vs. 1183), and the total information gain went up by 63% (4690 bits vs. 2870 bits). Furthermore, as Section 6.2 discusses, the optimization is ongoing and the conversion rate continues to go up even at the time of this writing. This automatic and continuous optimization of the process illustrates the benefits of leveraging existing, publicly available advertising platforms to improve the efficiency of crowdsourcing applications.
6.6 The effect of engagement incentives
To analyze the effect of the various incentive mechanisms, we examined how the different experimental conditions assigned to the users affected their participation and their overall contributions. To this end, we examined the effect of the various incentives on three variables of interest: the total number of submitted answers, the number of correct answers, and the (total information gain) score
of the user. Since the dependent variables are always positive and behave like “count data” we ran a Poisson regression, with eight binary variables as dependent variables, where each of these variables corresponded to the presence (or absence) of an experimental condition. Specifically, we present results for the following incentive mechanisms:
showCorrect: Show the correct answer.
showCrowdAnswers: Show the percentage of other users who answered the question correctly.
showMessage: Show whether the given answer was correct.
showPercentageCorrect: Show the percentage of submitted answers (for that user) that were correct.
showTotalCorrect: Show the total number of correct answers submitted by that user.
showScore: Show the total information gain for the user (shown as a score).
showPercentageRank: Show the position of the user in the leaderboard, ranked by percentage of correct answers.
showTotalCorrectRank: Show the position of the user in the leaderboard, ranked by the total number of correct answers submitted.
Table 2 summarizes the results, and shows the coefficients computed for each mechanism by the regression model. Showing the correct answer (showCorrect) has the strongest impact in increasing participation, as it has strong positive effect across all three dependent variables, indicating that users want to know what the correct answer is. Interestingly, knowing whether they were correct or not (showMessage) does not have a similarly strong effect. These results indicate that users may be more interested in learning about the topic rather than just knowing whether they answered correctly.
Experimenting with the performance-related incentives (i.e., showPercentageCorrect, showTotalCorrect, showScore) generated some interesting observations. Showing the percentage of correct answers did not have a statistically significant effect in terms of answer counts, but had a slightly positive effect in the total information gain. Showing the total number of correct answers generated an interesting effect: while both total and correct answers went up, the total information gain was affected negatively. It appears that non-competent users were also positively influenced to participate more, leading to a decrease of the overall answer quality. Not surprisingly, when we show the total information gain as a score to the user, this effect disappears, and we observe positive outcomes across all variables.
Finally, the competitive incentives (i.e., showCrowdAnswers, showPercentageRank, showTotalCorrectRank) demonstrated an interesting behavior of the users: knowing the performance of other users has a positive effect in the participation, which indicates that users are interested in how they fare against other users. However, displaying leaderboards had a generally negative effect across all variables. Interestingly enough, if we examine the effect of leaderboards in the early stages of the application, we see a very strong positive effect in terms of participation. Our hypothesis to explain these contradicting observations is the following. Early on, the leaderboards are relatively sparsely populated and it is relatively easy for users to go up and reach some of the top positions. However, as more and more users participate, the achievements of the top users are difficult to match, effectively discouraging users from trying harder. To test our hypothesis, we ran a small experiment, where the leaderboard was computed based on the participation from last week, as opposed to showing an all-time leaderboard. The results indicated that the “last-week” leaderboard with fewer and less impressive achievements has indeed had a positive effect on participation, compared to the “all-time” leaderboard. This indicates that users are motivated by the potential for achievement, and by showing the users that they can reach an achievement in a relatively easy manner can help with participation.
6.7 Comparison with paid crowdsourcing
Finally, we wanted to compare the performance of our approach against a pure paid-crowdsourcing setting. To this end, we hired workers through Mechanical Turk, and paid them 5 cents per question (i.e., piecemeal payment), with an extra bonus that depended on their total score (information gain) at the end. Similarly, we hired workers via oDesk, paid them on an hourly basis (ranging from $5/hr to $15/hr, depending on their asking price), and we also indicated that they will receive an additional payment based on their overall score. Figure 9 summarizes the results. Our key observation is that the workers hired through paid crowdsourcing platforms are usually not experts in the topic of the quiz, and are therefore not sufficiently knowledgeable to provide high quality answers. However, unlike the unpaid workers, the paid workers have an obvious monetary incentive to continue working, and so we did not observe the self-selection dropout effect for the paid workers. The paid workers continue submitting low-quality answers, and this finding is similar with both piecemeal and hourly payments.
While there are some compeqtent workers among the paid participants, the total information gain from the competent paid workers is still significantly lower than the information gain from unpaid users, resulting in significantly lower capacities. The best paid worker had a 68% quality for the quiz, and submitted 40 answers, resulting in an equivalent capacity of 13 answers at 99% accuracy, or 23 answers at 90% accuracy. To match the performance of the unpaid users, the worker should be paid 5 cents per question, or $3/hr, taking into account that the average time per question for the paid users is one minute. (For comparison, unpaid users are much faster and typically give an answer within 10 seconds, signaling that they are already knowledgeable about the topic of hte quiz and they do not perform research to answer the questions.) Given that all other workers demonstrated worse metrics, it is clear that unpaid, volunteer users dominate. A potential solution is to experiment with negative incentives (e.g., “you will not get paid unless you achieve this quality score”), keeping away the low perfomers, and keeping just the top workers. However, it is not clear how we can reach these high-quality workers in a labor market, other than by posting the task and then hoping that the competent workers will participate. Potentially, labor marketplaces can employ targeting schemes, similar to the ones we implemented using online advertising, but today we are not aware of any marketplace offering such functionality.
7 Related Work
Recent work [2, 11] built models on how badges and leaderboards should be designed to engage users and steer their behavior towards actions that are beneficial for the system. Our work empirically tests some of these models, and our experimental results dovetail the suggestions of these models. Other models of user engagement have examined what metrics and measurments capture the user level of engagement [23, 5, 6, 10, 26]. Our analysis of engagement focuses mainly on web analytics measurements, without trying to interact further with the participating users, although this is a promising direction for future work.
In our work, we explicitly assess the competence of users with calibration questions. Alternatively, we can use unsupervised techniques for estimating the competence of users, through redundancy. Dawid and Skene  presented an EM algorithm to estimate the quality of the participants in the absence of known ground truth, and a large number of recent papers examined the same topic [30, 39, 37] improving significantly the state of the art. Being closer to our work, Kamar et al.  also use a Markov Decision Process, in order to decide whether the answers provided by a user are promising enough to warrant a hiring decision. In the future, we plan to use these algorithms for quality inference together with our exploration/exploitation approach, to decide optimally how to combine assessment with knowledge acquisition. A key challenge is being able to provide immediate feedback to the users, when the questions have no certain answer.
Optimal acceptance sampling plans in quality control [9, 38, 7, 32] is another related line of work. The purpose of acceptance sampling is to determine how much to sample a production line, in order to decide whether to accept or reject a production lot. The key difference with our setting is the limited lifetime of the users (as opposed to the significantly higher production capacity in industrial production), and our planning needs to be much more dynamic than in the most use cases of acceptance sampling.
We presented a model for targeting and engaging unpaid users in a crowdsourcing application. We demonstrated how to use existing Internet advertising platforms to identify niche audiences of competent users for the task at hand, and we showed that using publicly available ad-optimization tools can result in significant improvements in the effectiveness of the process. Currently, our application has a 50% conversion rate for every ad click, and the cost per answer drops systematically over time, as the advertising system learns to identify competent users that are likely to be high contributors. The engagement of unpaid users alleviates concerns about the incentives of paid users, who are not always well-aligned with the goals of the crowdsourcing application. Furthermore, our algorithms and controlled real-life experiments with over ten thousand users illustrate how to setup incentive mechanisms in practice to engage users and extend their “lifetime” in the system. Finally, our experiments indicate that even though there are costs associated with advertising, the quality-adjusted costs are on par with those of paid crowdsourcing. (Moreover, for non-profits, engaging for example in citizen science efforts, there are ways to get a substantial advertising budget using programs such as “Google Ad Grants for nonprofits,”888http://www.google.com/grants/ which offers $10,000 per month in in-kind advertising budget.) We believe that our ad-based approach can form the foundation towards more predictable deployment and engagement of unpaid users in crowdsourcing efforts, combining the advantage of engaging unpaid users with the predictability of paid crowdsourcing.
We would like to thank Kevin Murphy and Chun How Tan for helpful discussions and suggestions.
Appendix A Variance of Information Gain
When the quality of a user is uncertain, then the information gain for each question is also uncertain. Under the assumption the probability , that a user answers a question correctly, is the same across all questions, and that the prior is a uniform distribution, then the variance of the information gain distribution is given by:
where is the digamma function, is the trigamma function, is the number of options presented to the user, is the number of correct, and is the number of incorrect answers submitted by the user .
-  Abhishek, V., and Hosanagar, K. Keyword generation for search engine advertising using semantic similarity between terms. In Proceedings of the ninth international conference on Electronic commerce (2007), ACM, pp. 89–94.
-  Anderson, A., Huttenlocher, D., Kleinberg, J., and Leskovec, J. Steering user behavior with badges. In Proceedings of the 22nd international conference on World Wide Web (2013), International World Wide Web Conferences Steering Committee, pp. 95–106.
-  Archer, E., Park, I. M., and Pillow, J. W. Bayesian entropy estimation for countable discrete distributions. CoRR abs/1302.0328 (2013).
-  Ariely, D. Predictably irrational: The hidden forces that shape our decisions. HarperCollins, 2009.
-  Attfield, S., Kazai, G., Lalmas, M., and Piwowarski, B. Towards a science of user engagement (position paper). In WSDM Workshop on User Modelling for Web Applications (2011).
-  Baeza-Yates, R., and Lalmas, M. User engagement: the network effect matters! In Proceedings of the 21st ACM international conference on Information and knowledge management (2012), ACM, pp. 1–2.
-  Berger, R. L. Multiparameter hypothesis testing and acceptance sampling. Technometrics 24, 4 (1982), 295–300.
-  Dawid, A. P., and Skene, A. M. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28, 1 (Sept. 1979), 20–28.
-  Dodge, H. F. Notes on the Evolution of Acceptance Sampling. American Society for Quality Control, 1973.
-  Dupret, G., and Lalmas, M. Absence time and user engagement: evaluating ranking functions. In Proceedings of the sixth ACM international conference on Web search and data mining (2013), ACM, pp. 173–182.
Easley, D., and Ghosh, A.
Incentives, gamification, and game theory: an economic approach to badge design.In Proceedings of the fourteenth ACM conference on Electronic commerce (2013), EC ’13, ACM, pp. 359–376.
-  Embretson, S. E., and Reise, S. P. Item response theory. Psychology Press, 2000.
-  Fuxman, A., Tsaparas, P., Achan, K., and Agrawal, R. Using the wisdom of the crowds for keyword generation. In Proceedings of the 17th international conference on World Wide Web (2008), ACM, pp. 61–70.
-  Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. Bayesian data analysis. CRC press, 2003.
Genest, C., and Zidek, J. V.
Combining probability distributions: A critique and an annotated bibliography.Statistical Science 1, 1 (1986), 114–135.
-  Gneezy, U., and Rustichini, A. Pay enough or don’t pay at all. The Quarterly Journal of Economics 115, 3 (2000), 791–810.
-  Hoffmann, R., Amershi, S., Patel, K., Wu, F., Fogarty, J., and Weld, D. S. Amplifying community content creation with mixed initiative information extraction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2009), ACM, pp. 1849–1858.
-  Horton, J. Online labor markets. Internet and Network Economics (2010), 515–522.
-  Ipeirotis, P. Demographics of mechanical turk. Tech. rep., New York University, 2010. Available at http://ssrn.com/abstract=1585030.
-  Joshi, A., and Motwani, R. Keyword generation for search engine advertising. In Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on (2006), IEEE, pp. 490–496.
-  Kamar, E., Hacker, S., and Horvitz, E. Combining human and machine intelligence in large-scale crowdsourcing. In AAMAS (2012), International Foundation for Autonomous Agents and Multiagent Systems, pp. 467–474.
-  Kuznetsov, S. Motivations of contributors to wikipedia. ACM SIGCAS computers and society 36, 2 (2006), 1.
-  Lehmann, J., Lalmas, M., Yom-Tov, E., and Dupret, G. Models of user engagement. In User Modeling, Adaptation, and Personalization. Springer, 2012, pp. 164–175.
-  Li, S.-M., Mahdian, M., and McAfee, R. P. Value of learning in sponsored search auctions. In Internet and Network Economics. Springer, 2010, pp. 294–305.
-  Malone, T. W. Toward a theory of intrinsically motivating instruction. Cognitive science 5, 4 (1981), 333–369.
-  McCay-Peet, L., Lalmas, M., and Navalpakkam, V. On saliency, affect and focused attention. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2012), ACM, pp. 541–550.
-  Murphy, K. P. Machine learning: a probabilistic perspective. The MIT Press, 2012.
-  Nov, O. What motivates wikipedians? Communications of the ACM 50, 11 (2007), 60–64.
-  Puterman, M. L. Markov decision processes: Discrete stochastic dynamic programming, vol. 414. Wiley. com, 2009.
-  Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. Learning from crowds. The Journal of Machine Learning Research 99 (2010), 1297–1322.
-  Robertson, S., Vojnovic, M., and Weber, I. Rethinking the esp game. In CHI’09 Extended Abstracts on Human Factors in Computing Systems (2009), ACM, pp. 3937–3942.
-  Schilling, E. G. Acceptance Sampling in Quality Control, vol. 42. CRC PressI Llc, 1982.
-  Von Ahn, L., and Dabbish, L. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems (2004), ACM, pp. 319–326.
-  Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., and Blum, M. Recaptcha: Human-based character recognition via web security measures. Science 321, 5895 (2008), 1465–1468.
-  Wang, J., Ipeirotis, P., and Provost, F. Quality-based pricing for crowdsourced workers. Tech. rep., New York University, 2013. Available at papers.ssrn.com/abstract=2283000.
-  Waterhouse, T. P. Pay by the bit: an information-theoretic metric for collective human judgment. In Proceedings of the 2013 conference on Computer supported cooperative work (2013), ACM, pp. 623–638.
-  Welinder, P., Branson, S., Belongie, S., and Perona, P. The multidimensional wisdom of crowds. Advances in Neural Information Processing Systems 23 (2010), 2424–2432.
-  Wetherill, G., and Chiu, W. A review of acceptance sampling schemes with emphasis on the economic aspect. International Statistical Review/Revue Internationale de Statistique (1975), 191–210.
-  Whitehill, J., Wu, T.-f., Bergsma, J., Movellan, J. R., and Ruvolo, P. L. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems (2009), pp. 2035–2043.
-  Yang, H.-L., and Lai, C.-Y. Motivations of wikipedia content contributors. Computers in Human Behavior 26, 6 (2010), 1377–1383.