Microtask crowdsourcing is gaining popularity among corporate and research communities as a means to leverage parallel human computation for extremely large problems [thomee2015new, bernstein2010soylent, bernstein2011crowds, lasecki2011real, lasecki2012real]. These communities use crowd work to complete hundreds of thousands of tasks per day [marcuswaran], from which new datasets with over million annotations can be produced within a few months [krishnavisualgenome]. A crowdsourcing platform like Amazon’s Mechanical Turk (AMT) is a marketplace subject to human factors that affect its performance, both in terms of speed and quality [difallah2015dynamics]. Prior studies found that work division in crowdsourcing follows a Pareto principle, where a small minority of workers usually completes a great majority of the work [little8020]. If such large crowdsourced projects are being completed by a small percentage of workers, then these workers spend hours, days, or weeks executing the exact same tasks. Consequently, we pose the question:
How does a worker’s quality change over time?
Multiple arguments from previous literature in psychology suggest that quality should decrease over time. Fatigue, a temporary decline in cognitive or physical condition, can gradually result in performance drops over long periods of time [perelli1980fatigue, boksem2008mental, krueger1989sustained]. Since the microtask paradigm in large scale crowdsourcing involves monotonous sequences of repetitive tasks, fatigue buildup can pose a potential problem to the quality of submitted work over time [dai2015and]. Furthermore, workers have been noted to be “satisficers” who, as they gain familiarity with the task and its acceptance thresholds, strive to do the minimal work possible to achieve these thresholds [simon1972theories, chandler2013risks].
To study these long term effects on crowd work, we analyze worker trends over three different real-world, large-scale datasets [krishnavisualgenome] collected from microtasks on AMT: image descriptions, question answering, and binary verifications. With microtasks comprising over of the total crowd work and microtasks involving images being the most common type [pew2016], these datasets cover a large percentage of the type of crowd work most commonly seen. Specifically, we use over million image descriptions from workers over a month span, million question-answer pairs from workers over a month span, and million verifications from workers over a month span. The average worker in the largest dataset worked for an average of eight-hour work days while the top of workers worked for nearly eight-hour work days. Using these datasets, we look at temporal trends in the accuracy of annotations from workers, diversity of these annotations, and the speed of completion.
Contrary to our hypothesis that workers would exhibit glaring signs of fatigue via large declines in submission quality over time, we find that workers who complete large sets of microtasks maintain a consistent level of quality (measured as the percentage of correct annotations). Furthermore, as workers become more experienced on a task, they develop stable strategies that do not change, enabling them to complete tasks faster. But are workers generally consistent or is this consistency simply a product of the task design?
We thus perform an experiment where we hire workers from AMT to complete large-scale tasks while randomly assigning them into different task designs. These designs were varied across two factors: the acceptance threshold with which we accept or reject work, and the transparency of that threshold. If workers manipulate their quality level strategically to avoid rejection, workers with a high (difficult) threshold would perform at a noticeably better level than the ones with a low threshold who can satisfice more aggressively. However, this effect might only be easily visible if workers have transparency into how they performed on the task.
By analyzing annotations collected from workers in the experiment on AMT, we found that workers display consistent quality regardless of their assigned condition, and that lower-quality workers in the high threshold condition would often self-select out of tasks where they believe there is a high risk of rejection. Bolstered by this consistency, we ask: can we predict a worker’s future quality months after they start working on a microtask?
If individual workers indeed sustain constant correctness over time, then, intuitively, any subset of a worker’s submissions should be representative of their entire work. We demonstrate that a simple glimpse of a worker’s quality in their first few tasks is a strong predictor of their long-term quality. Simply averaging the quality of work of a worker’s first completed tasks can predict that worker’s quality during the final of their completed tasks with an average error of .
Long-term worker consistency suggests that paying attention to easy signals of good workers can be key to collecting a large dataset of high quality annotations [mitra2015comparing, rzeszotarski2012crowdscape]. Once we have identified these workers, we can back off the gold-standard (attention check) questions to ensure good quality work, since work quality is unvarying [liueffective]. We can also be more permissive about errors from workers known to be good, reducing the rejection risk that workers face and increasing worker retention [difallah2014scaling, law2016curiosity].
2 Related Work
Our work is inspired by psychology, decision making, and workplace management literature that focuses on identifying the major factors that affect the quality of work produced. Specifically, we look at the effects of fatigue and satisficing in the workplace. We then study whether these problems transfer to the crowdsourcing domain. Next, we explore how our contributions are necessary to better understand the global ecosystem of crowdsourcing. Finally, we discuss the efficacy of existing worker quality improvement techniques.
Repeatedly completing the same task over a sustained period of time will induce fatigue, which increases reaction time, decreases production rate, and is linked to a rise in poor decision-making [krueger1989sustained, wyatt1937fatigue]. The United States Air Force found that both the cognitive performance and physical conditions of its airmen continually deteriorated during the course of long, mandatory shifts [perelli1980fatigue]. However, unlike these mandatory, sustained shifts, crowdsourcing is generally opt-in for workers — there always exists the option for workers to break or find another task whenever they feel tired or bored [lasecki2014using, lasecki2015effects]. Nonetheless, previous work has shown that people cannot accurately gauge how long they need to rest after working continuously, resulting in incomplete recoveries and drops in task performance after breaks [hennfng1989microbreak]. Ultimately, previous work in fatigue suggests that crowd workers who continuously complete tasks over sustained periods would result in significant decreases in work quality. We show that contrary to this literature, crowd workers remain consistent throughout their time on a specific task.
Crowd workers are often regarded as “satisficers” who do the minimal work needed for their work to be accepted [simon1972theories, chandler2013risks]. Examples of satisficing in crowdsourcing occur during surveys [krosnick1991response] and when workers avoid the most difficult parts of a task [mason2010financial]. Disguised attention checks in the instructions [oppenheimer2009instructional] or rate-limiting the presentation of the questions [kapelner2010preventing] improves the detection and prevention of satisficing. Previous studies of crowd workers’ perspectives find that crowd workers believe themselves to be genuine workers, monitoring their own work and giving helpful feedback to requesters [mcinnis2016taking]. Workers have also been shown to respond well and produce high quality work if the task is designed to be effort-responsive [ho2015incentivizing]. However, workers often consider the cost-benefit of continuing to work on a particular task — if they feel that a task is too time-consuming relative to its reward, then they often drop out or compensate by satisficing (e.g. reducing quality) [mcinnis2016taking]. Prior work has shown that We observe that satisficing does occur, but it only affects a small portion of long-term workers. We also observe in our experiments that workers opt out of tasks where they feel they have a high risk of rejection.
2.3 The global crowdsourcing ecosystem
With the rapidly growing size of crowdsourcing projects, workers now have the opportunity to undertake large batches of tasks. As they progress through these tasks, questions arise and they often seek help by communicating with other workers or the task creator [martin2014being]. Furthermore, on external forums and in collectives, workers often share well-paying work opportunities, teach and learn from other workers, review requesters, and even consult with task creators to give constructive feedback [martin2014being, irani2013turkopticon, salehi2015we, mcinnis2016taking]. When considering this crowdsourcing ecosystem, crowd researchers often envision how more complex workflows can be integrated to make the overall system more efficient, fair, and allow for a wider range of tasks to be possible [kittur2013future]. To continue the trend towards a more complex, but more powerful, crowdsourcing ecosystem, it is imperative that we study the long-term trends of how workers operate within it. Our paper seeks to identify trends that occur as workers continually complete tasks over a long period of time. We conclude that crowdsourcing workflows should design methods to identify good workers and provide them with the ability to complete tasks with a low threshold for acceptance as good workers work consistently hard regardless of the acceptance criteria.
2.4 Improving crowdsourcing quality
External checks such as verifiable gold standards, requiring explanations, and majority voting are standard practice for reducing bad answers and quality control [kittur2008crowdsourcing, callison2009fast]
. Other methods directly estimate worker quality to improve these external checks[ipeirotis2010quality, whitehill2009whose]. Giving external feedback or having crowd workers internally reflect on their prior work also has been shown to yield better results [dow2012shepherding]. Previous work directly targets the monotony of crowdsourcing, showing that by framing the task as more meaningful to workers (for example as a charitable cause), one obtains higher quality results [chandler2013breaking]. However, this framing study only had workers do each task a few times and did not observe long-term trends. We, on the other hand, explore the changes in worker quality on microtasks that are repeated by workers over long periods of time.
3 Analysis: Long-Term Crowdsourcing Trends
In this section, we perform an analysis of worker behavior over time on large-scale datasets of three machine learning labeling tasks: image descriptions, question answering, and binary verification. We examine common trends, such as worker accuracy and annotation diversity over time. We then use our results to answer whether workers are fatiguing or displaying other decreases in effectiveness over time.
We first describe the three datasets that we inspect. Each of the three tasks were priced such that workers could earn per hour and were only available to workers with a approval rating and who live in the United States. For the studies in this paper, workers were tracked by their AMT worker ID’s. The tasks and interfaces used to collect the data are described in further detail in the Visual Genome paper [krishnavisualgenome].
Image descriptions. An image description is a phrase or sentence associated with a certain part of an image. To complete this task, a worker looks at an image, clicks and drags to select an area of the image, and then describes it using a short textual phrase (e.g., “The dog is jumping to catch the frisbee”). Each image description task requires a worker to create unique descriptions for one randomly selected image, averaging at least words per description. Workers were asked to keep the descriptions factual and avoid submitting any speculative phrases or sentences. We estimate that each task takes around minutes and we allotted hours such that workers did not feel pressured for time. In total, image descriptions were collected from workers over months.
Question answers. Each question answering task asks a worker to write questions and their corresponding answers per image for different, randomly selected images. Workers were instructed to begin each sentence with one of the following questions: who, what, when, where, why and how [kuhn2013political]. Furthermore, to ensure diversity of question types, workers were asked to write a minimum of of these question types. Workers were also instructed to be concise and unambiguous to avoid wordy and speculative questions. Each task takes around minutes and we allotted hours such that workers did not feel pressured for time. In total, question-answer pairs were generated by workers over months.
Binary verifications. Verification tasks were quality control tasks: given an image and a question-answer pair, workers were asked if the question was relevant to the image and if the answer accurately responded to the question. The majority decision of 3 workers was used to determine the accuracy of each question answering pair. For each verification task, a worker voted on randomly-ordered question-answer pairs. Each task takes around minutes and we allotted hour such that workers did not feel pressured for time. In total, votes were cast by workers over months.
Overall. Figure 1 shows the distribution of how many tasks workers completed over the span of the data collection period, while Table 1 outlines the total number of annotations and tasks completed. The top of workers who completed the most tasks did , , and of the total work in each of the three datasets respectively. These distributions are similar to the standard Pareto - rule [little8020], clearly demonstrating that a small, but persistent minority of workers completes an extremely large number of similar tasks. We noticed that workers in the top each completed approximately of the respective datasets each, with image description tasks, question answering tasks, and verification tasks completed on average. If each of these workers in the top took minutes for image descriptions and question answering tasks and minutes for verification tasks, the estimated average work time equates to , and eight-hour work days for each task respectively. This sheer workload demonstrates that workers may work for very extended periods of time on the same task. Additionally, workers, on average, completed at least one task per week for weeks. By the final week of the data collection, about of the workers remained working on the tasks, suggesting that our study captures the entire lifetime of many of these workers.
We focus our attention on workers who completed at least tasks during the span of the data collection. The completion time for tasks is approximately hours for image description and question answering tasks and hours for verification tasks. We find that , , and workers completed of the image description, question answering, and verification tasks respectively. The median worker in each task type completed , , and tasks, which translates to , , and hours of continuous work. These workers also produced , and of each of the total annotations. These worker pools are relatively unique: there are shared workers between image descriptions and QA, shared workers between image description and verification, shared workers between question answering and verifications, and shared workers between all three tasks.
We reached out to the unique workers who had worked on at least tasks and asked them to complete a survey. After collecting responses, we found the gender distribution to be female, male, and other (Figure 2). Furthermore, we found that workers with ages - were the majority at of the long-term worker population. Ages -, -, and respectively comprised , and of the long-term worker population. Compared to the distributions in previously gathered demographics on AMT [pew2016, difallah2014scaling, ross2010crowdworkers], the gender and age distribution of all workers closely aligns with these other previously gathered distributions [krishnavisualgenome]
. However, the distribution of long-term workers is skewed towards older and female workers.
3.2 Workers are consistent over long periods
We analyzed worker accuracy and annotation diversity over the entire period of time that they worked on these tasks. Because workers performed different numbers of tasks, we normalize time data to percentages of their total lifetime, which we define as the period from when a worker starts the task until they stop working on that task. For example, if one worker completed tasks and another completed tasks, then the halfway point in their respective lifetimes would be when they completed and tasks.
Annotation accuracy. A straightforward metric of quality is the percentage of microtasks that are correct. To determine accuracy for an image description or question answering task, we computed the percentage of descriptions or question-answer pairs deemed true by a majority vote made by other workers. However, to use this majority vote in a metric, we need to first validate that this verification process is repeatable and accurate. Since the ground truth of verification tasks is unknown at such a large scale, we need a method to estimate the accuracy of each verification decision. We believe that comparing a worker’s vote against the majority decision is a good approximation of accuracy. To test accuracy, we randomly sampled a set of descriptions and image answers and manually compared our own verifications against the majority vote, which resulted in a match. To test repeatability, we randomly sampled a set of descriptions and question answers to be sent back to be voted on by new workers months after the initial dataset was collected. Ultimately, we found a similarity between the majority decision of this new verification process with the original decision reported in the dataset [krishnavisualgenome]. The result of this test indicates that the majority decision is both accurate and repeatable, making it a good standard to compare against.
We find that workers change very little over time (Figures 3 and 4). When considering those who did at least 100 image description tasks, people on average started at accuracy and ended at , averaging an absolute change of . Workers who did at least question answering tasks started with an average of and ended at , resulting in an absolute change of . For the verification task, workers agreed with the majority on average at the start and at the end, resulting in an absolute change of .
Accuracy captures clearly correct or incorrect outcomes, but how about subtler signals of effort level? Since each image description or question answering task produces multiple phrases or questions, we examine the linguistic similarity of these phrases and questions over time. As N-grams have often been used in language processing for gauging similarity between documents[damashek1995gauging], we construct a metric of syntax diversity for a set of annotations as follows:
As the annotation set increasingly contains different words and ordering of words, this diversity metric approaches because the number of unique N-grams will approach the total possible N-grams. Conversely, if the annotation set contains increasingly similar annotations, many N-grams will be redundant, making this diversity metric approach . To account for workers reusing similar sentence structure in consecutive tasks, we track the number of unique N-grams versus total N-grams in sequential pairs of tasks.
Figure 5 illustrates that the percentage of unique bigrams decreases slightly over time. In the image description task, the percent of unique bigrams decreases on average from to between the start and end of a worker’s lifetime. Since there are bigrams on average per phrase, a worker writes approximately total bigrams per task. Thus, a decrease in results in a loss of unique bigrams per task. In the question answering task, the percent of unique bigrams decreases on average from to . As there are on average bigrams per question, this decrease would cost a loss of distinct bigrams per task. Ultimately, these results show that over the course of a worker’s lifetime, only a small fraction of diversity is lost, as less than a sentence or question’s contribution of bigrams is lost.
A majority of workers stay constant during their lifetime. However, a few workers decrease to an extremely low N-gram diversity, despite writing factually correct image descriptions and questions. This behavior describes a “satisficing” worker, as they repeatedly write the same types of sentences or questions that generalize to almost every image. Figure 6 demonstrates how a satisficing worker’s phrase diversity decreases from image-specific descriptions submitted in early-lifetime tasks to generic, repeated sentences submitted in late-lifetime tasks. To determine the percentage of total workers who are satisficing workers, we first compute the average diversity of submissions for each worker. We then set a threshold equal the difference between the maximum and mean of these diversities, labeling workers below the mean by this threshold as satisficers. We find that approximately and of workers satisfice in the image description and question answering datasets respectively.
Annotation speed. We recorded the time it takes on average for workers to complete a single verification. We removed
of the data points deemed as outliers from this computation, as workers will infrequently take longer times during a break or while reading the instructions. We defined outliers for each task of 50 verifications as times that outside 3 standard deviations of the mean time for those 50 verifications. Overall, Figure7 demonstrates that workers indeed get faster over time. Initially, workers start off taking seconds per verification task, but end up averaging under seconds per task, resulting in an approximate speedup. Although no time data was recorded for either the image descriptions or question answering tasks, we believe that they would also exhibit similar speedups over time due to practice effects [newell1981mechanisms] and similarities in the correctness and diversity metrics.
No significant fatigue effects are exhibited by in long-term workers. Workers do not appear to suffer from long-term fatigue effects. With an insignificant average accuracy drop of on average for workers across their lifetime, we find that workers demonstrate little change in their submission quality. Instead of suffering from fatigue, workers may be opting out or breaking whenever they feel tired [dai2015and]. Furthermore, this finding agrees with previous literature that cumulative fatigue is not a major factor in quality drop [perelli1980fatigue].
Accuracy is constant within a task type, but varies across different task types. We attribute the similarity between the average accuracy of the question answering and verification tasks to their sequential relationship in the crowdsourcing pipeline. If the question-answer pairs are ambiguous or speculative, then the majority vote often becomes split, resulting in accuracy loss for both the question answering and verification tasks. Additionally, we notice the average accuracy for image descriptions is noticeably higher than the average accuracy for either the question answering or verification datasets. We believe this discrepancy stems from the question answering task’s instructions that ask workers to write at four distinct types of W questions (e.g. “why”, “what”, “when”). Some question types such as “why” or “when” are often ambiguous for many images (e.g. “why is the man angry?”). Such questions are often marked as incorrect by other workers in the verification task. Furthermore, we also attribute the disparity between unique bigram percentage for the image description and question answering tasks to the question answering task’s instructions that asked workers to begin each question with one of the question types.
Experience translates to efficiency. Workers retain constant accuracy, and slightly reduce the complexity of their writing style. Combined, these findings suggest that workers find a general strategy that leads to acceptance and stick with it. Studies of practice effects suggest that a practiced strategy helps to increase worker throughput according to a power law [newell1981mechanisms]. This power law shape is clearly evident in the average verification speed, confirming that practice plays a crucial role in the worker speedup.
Overall findings. From an analysis of the three datasets, we found that fatigue effects are not significantly visible and that severe satisficing behavior only affects a very small proportion of workers. On average, workers maintain a similar quality of work over time, but also get more efficient as they gain experience with the task.
4 Experiment: Why Are Workers Consistent?
Examining the image descriptions, question answering, and the verification datasets, we find that worker’s performance on a given microtask remains consistent — even if they do the task for multiple months. However, mere observation of this consistency does not give true insight into the reasons for its existence. Thus, we seek to answer the following question: do crowd workers satisfice according to the minimum quality necessary to get paid, or are they consistent regardless of this minimum quality?
To answer this question, we perform an experiment where we vary the quality threshold of work and the threshold’s visibility. If workers are stable, we would expect them to either submit work that is above or below the threshold, irrespective of what the threshold is. However, if workers satisfice according to the minimum quality expected, they would adjust the quality of their work based on set threshold [simon1972theories, krosnick1991response].
If workers indeed satisfice, then the knowledge of this threshold and their own performance should make it easier to perfect satisficing strategies. Therefore, to adequately study the effects of satisficing, we vary the visibility of the threshold to workers as well. In one condition, we display workers’ current quality scores and the minimum quality score to be accepted, while the other condition only displays whether submitted work was accepted or rejected. To sum up, we vary the threshold and the transparency of this threshold to determine how crowd workers react to the same task, but with different acceptability criteria.
To study why workers are consistent, we designed a task where workers are presented with a series of randomly ordered binary verification questions. Each verification requires them to determine if an image description and its associated image part are correct. For example, in Figure 8, workers must decide if “the zebras have stripes” is a good description of a particular part of the image. They are asked to base their response based solely on the content of the image and the semantics of the sentence. To keep the task simple, we asked workers to ignore whether the box was perfectly surrounding the image area being described. The tasks were priced such that workers could earn per hour and were available to workers with a approval rating and who lived in the United States. Each task took approximately minutes to complete and were given hours to complete the task to ensure workers were not pressured for time.
We placed attention checks in each task. Attention checks are gold-standard verification questions whose answers were already known. Attention checks were randomly placed within the series of verifications to gauge how well a worker performed on the given task. To avoid workers from incorrectly marking an attention check due to subjective interpretation of the description, we manually marked these attention checks correct or incorrect. Examples of attention checks are shown in Figure 9. Incorrect attention checks were completely mismatched from their image; for example “A very tall sailboat” was used as a incorrect attention check matched to an image of a lady wearing a white dress. We created a total of unique attention checks to prevent workers from simply memorizing the attention checks.
Even though these attention checks were designed to be obviously correct or incorrect, we ensured that we do not reject a worker’s submission based off a single, careless mistake or an unexpected ambiguous attention check. After completing a task, each worker’s submission is immediately accepted or rejected based on a rating, which is calculated as the percentage of the last attention checks correctly labeled. If a worker’s rating falls below the threshold of acceptable quality, their task is rejected. However, to ensure fair payment, even if a worker’s rating is below the threshold, their task is accepted if they get all the attention checks in the current task correct. This enables workers who are below the threshold to perform carefully and improve their rating as they continue to do more tasks.
4.2 Experiment Setup
Our goal is to vary the acceptance threshold to see how it impacts worker quality over time. We performed a between-subjects study where we varied threshold and transparency. We ran an initial study with a different set of workers to estimate how people performed on this verification task. We found that workers get a mean accuracy of with a median accuracy of . We chose the thresholds such that the high threshold condition asked workers to perform above the median and the low threshold was below the standard deviation, allowing workers plenty of room to make mistakes. The high threshold factor level was set at while the low threshold factor level was set at . Workers in the high threshold level could only incorrectly label at most out of of the previous attention checks to avoid rejection, while workers in the low threshold level could error on out of the past attention checks.
We used two levels of transparency: high and low. In the high factor level, workers were able to see their current rating at the beginning of every task and were also alerted of how their rating changed after submitting each task. Meanwhile, in the low factor level, workers did not see their rating, nor did they know what their assigned threshold was.
We recruited workers from AMT for the study and randomized them between conditions. We measured workers’ accuracy and total number of completed tasks under these four conditions.
4.3 Data Collected
By the end of the study, workers completed tasks. In total, binary verification questions were answered, of which were attention checks. Table 2 shows the breakdown of the number of workers who completed at least task. Not all workers who accepted tasks completed them. In the high threshold condition, and workers did not complete any tasks in the high and low transparency conditions respectively. Similarly, and workers did not complete tasks in the low threshold. This resulted in and more people in the low threshold that completed tasks. Workers completed on average a total of verifications each.
On average, the accuracy of the work submitted by workers in all four conditions remained consistent (Figure 10). In the low threshold factor level, workers averaged a rating of and in the high and low transparency factor levels. Meanwhile, when the threshold was high, workers in the low transparency factor level averaged while the workers in the high transparency factor level averaged . Overall, the high transparency factor level had a smaller standard deviation throughout the course of workers’ lifetimes. We conducted a two-way ANOVA using the two factors as independent variables on all workers who performed more than tasks. The ANOVA found that there was no significant effect of threshold (F(, )=, p=) or transparency (F(, )=, p=), and no interaction effect (F(, )=, p=). Thus, worker accuracy was unaffected by the accuracy requirement of the task.
Unlike accuracy, worker retention was influenced by our manipulation. By the task, less than of the initial worker population continued to complete tasks. This result is consistent with our observations with the Visual Genome datasets and from previous literature that explains that a small percentage of workers complete most of the crowdsourced work [little8020]
. We also observe that workers in the high threshold and high transparency condition have a sharper dropout rate in the beginning. To measure the effects of the four conditions on dropout, we analyzed the logarithm of the number of tasks completed per condition using an ANOVA. (Log-transforming the data ensured that it was normally distributed and thus amenable to ANOVA analysis.) The ANOVA found that there was a significant effect of transparency (F(, )=, p<) and threshold (F(, )=, p<), and also a significant interaction effect (F(, )=, p<). A post hoc Tukey test [tukey1949comparing] showed that the (1) high transparency and high threshold condition had significantly less retention than the (2) low transparency and high threshold condition ().
|Threshold||High: 96||Low: 70|
|# workers with 0 tasks||106||116||137||138|
|# workers with >1 tasks||267||267||300||300|
Workers are consistent in their quality level. With this experiment, we are now ready to answer whether workers are consistent or satisficing to an acceptance threshold. Given that workers’ quality was consistent throughout all the four conditions, evidence suggests that workers were consistent, regardless of the threshold at which requesters accept their work. In the low threshold and high transparency condition, workers are aware that their work will be accepted if their rating is above %, and still perform with an average rating of %. Workers are risk-averse, and seek to avoid harms to their acceptance rate [mcinnis2016taking]. Once they find a strategy that allows their work to be accepted, they stick to that strategy throughout their lifetime [mitra2015comparing]. This result is consistent with the earlier observational data analysis.
Workers minimize risk by opting out of tasks above their natural accuracy level. If workers do not adjust their quality level in response to task difficulty, the only other possibility is that workers self-select out of tasks they cannot complete effectively. Our data supports this hypothesis: workers in the high transparency and high threshold condition did statistically fewer tasks on average. The workers self-selected out of the task when they had a higher chance of rejection. Out of workers in the high transparency and high threshold condition, workers workers stopped working once their rating dropped below the % threshold. Meanwhile, in the high transparency and low threshold condition, out of the workers who completed our tasks, almost all of them continued working even if their rating dropped below the % threshold, often bringing their rating back up to above %.
This study illustrates that workers are consistent over very long periods of hundreds of tasks. They quickly develop a strategy to complete the task within the first few tasks and stick with it throughout their lifetime. If their work is approved, they continue to complete the task using the same strategy. If their strategy begins to fail, instead of adapting, they self-select themselves out of the task.
5 Predicting From Small Glimpses
The longitudinal analysis in the first section and the experimental analysis in the second section found that crowd worker quality remains consistent regardless of how many tasks the worker completes and regardless of the required acceptance criteria. Bolstered by this result, this section demonstrates the efficacy of predicting a worker’s future quality by observing a small glimpse of their initial work. The ability to predict a worker’s quality on future tasks can help requesters identify good workers and improve the quality of data collected.
5.1 Experimental Setup
To create a prediction model, we use the question answering dataset. Our aim is to predict a worker’s quality on the task towards the end of their lifetime. Since workers’ individual quality on every single task can be noisy, we estimate a worker’s future quality as the average of their accuracy on the last of their tasks in their lifetime.
We allow our model to use between the first and the first tasks completed by a worker to estimate their future quality. Therefore, we only test our model on workers who have completed at least tasks. As a baseline, we calculate the average of all workers’ performances on their last tasks. We use this value as our guess for each individual worker’s future quality. This model assumes a worker does as well as the average worker does on their final tasks.
Besides the baseline, we use two separate models to estimate a worker’s future quality: average and sigmoid models. The average model is a simple model that uses the average of the worker’s tasks as the estimate for all future quality predictions. For example, if a worker averages % accuracy on their first five tasks, the average model would predict that the worker will continue to perform at a % accuracy. However, if the worker’s quality on their last % of tasks is %, then the prediction error would be %. The sigmoid model attempts to represent a worker’s quality as a sigmoid curve with parameters to adjust for the offset of the curve. We use a sigmoid model because we find that many workers display a very brief learning curve over the first few tasks and remain consistent thereafter. The initial adjustment and future consistency closely resembles a sigmoid curve.
The average of all workers’ accuracy is . Using this value as a baseline model for quality yields an error of . We plot the error of the baseline as a dotted line in Figure 11. The average model performs better: even for only a glimpse of tasks, its error is . After seeing a worker’s first tasks, the model gets slightly better and has a prediction error of . The sigmoid model outperforms the baseline but underperforms the average model and achieves an error of for . As the model incorporates more tasks, it becomes the most accurate, managing an error rate of after seeing tasks. Furthermore, the model’s standard deviation of the error also decreases from to as increases.
Even a glimpse of five tasks can predict a worker’s future quality. Since workers are consistent over time, both the average and the sigmoid models are able to model workers’ quality with very little error. When workers initially start doing work, a simple average model is a good choice for a model to estimate how well the worker might perform in the future. However, as the worker completes more and more tasks, the sigmoid model is able to capture the initial adjustment a worker makes when starting a task. By utilizing such models, requesters can estimate which workers are most likely to produce good work and can easily qualify good workers for long-term work.
6 Implications for Crowdsourcing
Encouraging diversity. The consistent accuracy and constant diversity of worker output over time makes sense from a practical perspective: workers are often acclimating to a certain style of completing work [mcinnis2016taking] and often adopt a particular strategy to get paid. However, this formulaic approach might run counter to a requester’s desire to have richly diverse responses. Checks to increase diversity, such as enforcing a high threshold for diversity, should be employed without fear of worker quality as we have observed that quality does not significantly change with varying acceptance thresholds. Therefore, designing tasks that promote diversity without effecting the annotation quality is a ripe area for future research.
Worker retention. Additional experience affects completion speeds but does not translate to higher quality data. Much work has been done to retain workers [dai2015and, difallah2014scaling, law2016curiosity], but, as shown, retention does not equate to increases in worker quality — just more work completed. Further work should be conducted to not only retain a worker pool, but also examine methods of identifying good workers [karger2011budget] and more direct interventions for training poorly performing workers [dow2012shepherding, kittur2008crowdsourcing].
Additionally, other studies have shown that the motivation of workers is the predominant factor in the development of fatigue, rather than the total time worked [beckers2004working]. Although crowdsourcing can be intrinsically motivated [von2006games], the microtask paradigm found in the majority of crowdsourcing tasks favors a structure that is efficient [krishna2016embracing, heer2010crowdsourcing] for workers rather than being interesting for them [chandler2013breaking]. Future tasks should consider building continuity in their workflow design for both individual worker efficiency [lasecki2014using] and overall throughput and retention [dai2015and].
Person-centric versus process-centric crowdsourcing. Attaining high quality judgments from crowd workers is often seen as a challenge [rashtchian2010collecting, shaw2011designing, sorokin2008utility]. This challenge has catalyzed studies suggesting quality control measures that address the problem of noisy or low quality work [downs2010your, kittur2008crowdsourcing, mason2010financial]. Many of these investigations study various quality-control measures as standalone intervention strategies. While we explored process-centric measures like varying the acceptance or transparency threshold, previous work has experimented with varying financial incentives [mitra2015comparing]. All the results support the conclusion that process-centric strategies do not have significant difference in the quality of work submitted. While we agree that such process focused strategies are important to explore, our data reinforces that person-centric strategies (like utilizing worker approval ratings or worker quality on initial tasks) may be more effective [mitra2015comparing, rzeszotarski2012crowdscape] because they identify a worker’s (consistent) quality early on.
Limitations. Our analysis solely focuses on data labeling microtasks, and we have not yet studied whether our findings translate over to more complex tasks, such as designing an advertisement or editing an essay [kittur2011crowdforge, bernstein2010soylent]. Furthermore, we focus on weeks-to-months crowd worker behavior based on datasets collected over a few months, but there exist some crowdsourcing tasks [brelig2013system] that have persisted far longer than our study. Thus, we leave the analysis of crowd worker behavior spanning multiple years to future work.
Microtask crowdsourcing is rapidly being adopted to generate large datasets with millions of labels. Under the Pareto principle, a small minority of workers complete a great majority of the work. In this paper, we studied how the quality of workers’ submissions change over extended periods of time as they complete thousands of tasks. Contrary to previous literature on fatigue and satisficing, we found that workers are extremely consistent throughout their lifetime of submitting work. They adopt a particular strategy for completing tasks and continue to use that strategy without change. To understand how workers settle upon their strategy, we conducted an experiment where we vary the required quality for large crowdsourcing tasks. We found that workers do not satisfice and consistently perform at their usual quality level. If their natural quality level is below the acceptance threshold, workers tend to opt out from completing further tasks. Due to this consistency, we demonstrated that brief glimpses of just the first five tasks can predict a worker’s long-term quality. We argue that such consistent worker behavior must be utilized to develop new crowdsourcing strategies that find good workers and collect unvarying high quality annotations.
Acknowledgements. Our work is partially funded by an ONR MURI grant.