Annotations (aka labels) provide the basis for training and testing supervised learning models. Consequently, ensuring the quality of annotations is important, especially in a crowdsourced setting with remote, inexpert annotators. While quality assurance for crowdsourcing is well-studied, relatively little work has studiedvariable effort annotation tasks in which the number of labels required per item can greatly vary. Examples might include labeling all faces in an image, named entities in a text, or bird calls in an audio recording. Because the number of instances to label per item can greatly vary, some items require far more effort than others to annotate. Moreover, because there is typically no natural upper-bound on the number of instances present, some individual items may require enormous effort. Finally, the annotation effort required for each item is not known until after it is annotated since determining the number of labels required is an implicit part of the annotation task itself.
In this paper, we first conceptualize the notion of variable effort tasks and how they differ from more typical annotation tasks. For example, such annotation tasks are implicitly two-step: searching the item for all instances matching a target type (e.g., “face”), then applying a labeling operation (e.g., bounding box) to each matching instance. With labeling effort proportionate to the size of search results, the variable size of search results is the key challenge. This framing also helps us relate variable effort tasks to a wider class of annotation search tasks (Kutlu et al., 2020).
Next, we empirically investigate the specific variable effort labeling task of object detection: finding and localizing human faces in Open Images (Kuznetsova et al., 2020) (via bounding boxes). Whereas many prior studies on object detection report results on simpler datasets having only a few objects per image, our dataset includes as many as 14 faces per image. Our results show that crowdsourced annotator accuracy and recall on Mechanical Turk (MTurk) drops markedly as the number of faces per image increases. We hypothesize a set of key underlying issues leading contributing to this reduced quality: inconsistency of worker experience, the potential for high cognitive load, and ineffective incentive design.
To address these issues, we adapt and assess a set of general best-practice methods for quality crowdsourcing: financial incentives, workflow design, and visible gold (i.e., questions that provide periodic feedback to workers on their accuracy as they work). We implement five specific approaches: variable pay per instance, post-task bonuses, task decomposition, iterative improvement, and in-task visible gold with uniform frequency. Notably, only visible gold improves quality.
Motivated by this finding, we further explore the design space for effective use of visible gold questions in variable effort labeling tasks. While prior work shows that visible gold can improve data quality (Le et al., 2010; Gadiraju et al., 2015), many questions remain. How should we present feedback for variable effort labeling tasks like object detection? What is the optimal strategy to issue visible gold questions? How can the effect of visible gold be strengthened by quality-related consequences (i.e., warnings and bonuses)? We explore different variants of visible gold task designs. We find that combining both upfront and regular testing sustains data quality significantly better than upfront or regular testing alone. Moreover, imposing quality-related consequences yields further improvement. Our final variant of visible gold integrates dynamic testing with tier-based consequences and significantly outperforms all other approaches.
Contributions. We make three primary contributions in this work:
We conceptualize a class of variable effort human annotation tasks. We identify a unique set of data quality challenges they present, along with an empirical analysis of these challenges in the context of object detection.
We systematically evaluate existing methods to address these challenges and show that providing in-task feedback through visible gold significantly outperforms various other baselines, including approaches that adjust pay according to effort or that standardize effort at constant pay.
We contribute an in-depth analysis of different visible gold variants investigating issuance patterns and consequences for workers. Based on these investigations, we propose and evaluate an improved visible gold design that significantly increases bounding box accuracy by 5.7% compared to a basic visible gold variant and by 7.5% compared to a baseline without visible gold.
2. Related Work
2.1. Financial Incentives and Crowd Work
Financial incentives can influence work in various ways: who chooses to accept work, how much work they perform, and the quality of work they produce. Vaughan (2017) presents a valuable, succinct review of related work in this area. Early work suggested quality was not impacted by payment (Mason and Watts, 2009; Buhrmester et al., 2011; Grady and Lease, 2010). In some cases (Buhrmester et al., 2011; Grady and Lease, 2010), the difference in payments may have been too low to influence behavior. Mason and Watts (2009) hypothesized an anchoring effect, with workers’ sense of fair payment anchored by whatever was offered. Ipeirotis (2011) reports a similar finding.
Whereas the early studies used crowdsourcing tasks that were relatively easy to perform, Ho et al. (2015) and Ye et al. (2017) instead studied “effort-responsive tasks” in which workers could improve output via more time or effort, and did see the quality improve with financial incentives. Yin and Chen (2016) find that while very engaged or un-engaged workers appear insensitive to price, more middling workers improve quality with financial incentives.
Horton and Chilton (Horton and Chilton, 2010) frame the issue wrt. the economics notion of reservation wage: “…the minimum wage a worker is willing to accept …for performing some task; it is the key parameter in models of labor supply.” Thus as pay decreases, it could fail to match more workers’ reservation thresholds and thus potentially bias the sample of workers who choose to perform the task. However, Horton and Chilton find mixed evidence for worker behavior conforming to predictions of the rational model: “workers are clearly sensitive to price but insensitive to variations in the amount of time it takes to complete a task.”
While MTurk’s pay-per-task pricing model is familiar in crowdsourcing research, this model encourages work efficiency, but risks rushed work since worker earnings can be increased by completing more tasks in less time. An alternative pricing model is hourly pay. Both Mankar et al. (2017) and Whiting et al. (2019) proposed technical approaches making it easier for requesters to offer hourly pay jobs on MTurk. Some commercial vendor workforces111https://aws.amazon.com/sagemaker/groundtruth/pricing/ also set fixed hourly pay rates. For example, Amazon SageMaker GroundTruth’s popular vendor iMerit222https://aws.amazon.com/marketplace/pp/B07DK37Q32 charges $6.12/hour per worker. While hourly pay has the potential to discourage rushed work, since all time worked is compensated, a vendor workforce may still operate internally on a call-center model, where workers may have productivity quotas that similarly encourage them to work efficiently (to enable the vendor to provide competitive pricing and ensure profitability).
2.2. Workflow Designs
A crowdsourcing task workflow defines how a single task is organized into a set of HITs that can be completed by one or more workers. Literature shows that we can improve the data quality by adopting a suitable workflow for each task. In this work, we are particularly interested in workflow designs that can be used to standardize the task effort in variable effort tasks.
Prior work highlights two main workflow paradigms, iterative and parallel (Goto et al., 2016). In iterative workflows, we present the same task to multiple workers in a sequential manner where workers could see previous workers’ responses. Little et al. (Little et al., 2010) shows that an iterative workflow can improve the average data quality in writing and brainstorming tasks. However, the paper highlights that work produced through parallel workflows could still yield individual responses with higher quality. In a translation task, Ambati et al. (Ambati et al., 2012) shows that a 3-phased iterative workflow can achieve higher quality than a baseline that gathers individual translations from 5 workers for each item. Iterative workflows can also allow us to engage workers with different expertise levels at each iteration (Ambati et al., 2012).
Parallel workflows aim to get multiple workers to work on parts of the task at the same time (Goto et al., 2016). Parallel work can be on the same task unit (i.e., obtaining multiple answers for the same unit) or smaller sub-tasks obtained through task decomposition. Find-Fix-Verify (Bernstein et al., 2010) is a specific workflow pattern that facilitates task decomposition through the initial find step and works well for writing tasks such as proofreading, formatting, and shortening text (Bernstein et al., 2010). Prior work by Kittur et al. (Kittur et al., 2011) proposes a framework for decomposing complex crowd tasks. It shows that in a writing task, articles produced through task decomposition received higher ratings and had lower variability than individual-produced articles. Recent work has also investigated how to optimally decompose a task into atomic sub-tasks considering the desired reliability and cost (Tong et al., 2018).
Often a combination of iterative and parallel elements can be used to create a more versatile workflow. Other notable work includes tools that can help visualize and manage complex crowdsourcing workflows (Kittur et al., 2012), allow workers to create workflows (Kulkarni et al., 2012), and optimize workflows (Dai et al., 2013).
2.3. Gold Standard Questions
The use of gold standard questions (also known as control or gold questions) is a fundamental and widely used quality control mechanism in crowdsourcing (Daniel et al., 2018). By injecting gold standard questions and evaluating responses, requesters can accurately measure worker performance (Huang and Fu, 2013).
predict the optimum number of gold questions to include for estimation tasks such as estimating the price of a product. The paper concludes that when using a two-stage estimation (i.e., estimate the worker quality only using gold data), the number of control questions should be equal to the square root of the number of labels provided by the worker as a rule of thumb. Recent work has also explored more dynamic approaches that leverage gold standard questions to select tasks for workers such that overall accuracy is maximized (Bragg et al., 2016; Fan et al., 2015; Khan and Garcia-Molina, 2017). However, the problem of utilizing and assigning gold standard questions has been mainly investigated in the context of multiple-choice questions, and some solutions are not generalizable across different task types. In particular, much of the previous work has relied on worker accuracy estimation models that only work with binary outcomes or multiple-choice questions.
On the one hand, using a small pool of gold standard questions can lead to problems when gold questions are repeated and flagged by workers (Checco et al., 2018, 2020). On the other hand, crowdsourcing is typically used for problems for which sourcing ground truth data is not straightforward. Thus, creating good gold data at scale and at a low cost is essential for implementing gold standards. Oleson et al. (Oleson et al., 2011) propose a programmatic approach to generate gold standard data. This study indicates that programmatic gold can increase the gold per question ratio, allowing for high-quality data without increased costs.
As opposed to creating gold questions prior to the label collection, we can also iteratively validate selected answers using experts. For example, Hung et al. (Hung et al., 2015) investigate classification tasks and proposes a probabilistic model that can find the most beneficial answer to validate in terms of result correctness and detection of faulty workers. Reliable and high-quality gold data can also be generated by using domain experts (Hara et al., 2013).
2.4. Visible Gold
Typically, workers cannot distinguish between a regular question and a gold standard question. Answers received for gold questions are used to estimate the worker quality in the post-processing step or during run-time. However, gold standard questions can also be used to provide training and feedback to workers (Le et al., 2010; Gadiraju et al., 2015; Doroudi et al., 2016).
Research shows that providing feedback can enhance data quality in crowdsourcing. Dow et al. (Dow et al., 2012) report that both self-assessment and external expert feedback can improve crowd work quality. The study highlights that workers who receive external assessments tend to revise their work more (Dow et al., 2012). Similarly, feedback from peers in organized worker groups can help workers achieve high output quality (Whiting et al., 2017). In a peer-review setup, the process of reviewing others’ work has also been shown to help workers elevate their own data quality (Zhu et al., 2014). While peer and expert feedback can improve data quality, it is difficult to achieve the timeliness that is critical for implementing a feedback system at scale22todo: 2TM: Is there any evidence that immediate feedback is more useful than delayed feedback? If so, that would be a useful point here.. In addition to feedback on work, workers could also benefit from learning opportunities on how to effectively use the tools and their related metrics (Savage et al., 2020).
From prior research by Dow et al. (Dow et al., 2012), we can identify three key aspects of feedback for crowd work. ‘Timeliness’ is how quickly the worker receives the feedback in either a synchronous or asynchronous fashion. ‘Specificity’ is the level of detail in the feedback, ranging from a binary decision (e.g., approve, reject) to template-based structured feedback to detailed task-specific feedback. Finally, the ‘Source’ of the feedback could be the requester, experts, peer workers, or the worker him- or herself.
A dedicated training phase where workers complete several training tasks and receive feedback until they reach the desired quality level has also been shown to be effective in crowd tasks that involve complex tools and interfaces (Park et al., 2014). Prior work also shows that training or feedback can also introduce a bias due to the specific examples selected for the training/feedback step (Le et al., 2010). Other work uses feedback to clarify ambiguous task instructions as opposed to improving the quality of work. For instance, Manam and Quinn (Manam and Quinn, 2018) propose a Q&A and Edit functionality that can be used by workers to clarify and improve task instructions or questions.
show that in a relevance categorization task, a uniform distribution of labels in visible gold standard data produces optimal peaks when considering individual worker precision, as well as majority voting aggregated results. Their study includes a dedicated pre-task training phase to qualify for the task. Visible gold questions are inserted based on a simple ratio where workers encounter 1 visible gold question for every 4 questions. Workers are also blocked from continuing on a task if their accuracy is low. Before being blocked, workers receive a warning that their accuracy is too low and that they should reread the instructions to correct mistakes.
test with two training methods with visible golds. In implicit training, workers are provided training when they provide erroneous responses to gold questions, and in explicit training, workers are required to go through a training phase before they attempt to work on the task itself. The results indicate that training provides a 5% performance gain and 40% time gain across 4 task types (Information Finding, Spam detection, Sentiment Analysis, Image transcription). However, the experiment setup doesn’t define a specific gold injection strategy for implicit training. Instead, it considers all questions as gold. Using complex web search challenges as the task,Doroudi et al. (Doroudi et al., 2016) also show that providing expert examples upfront is an effective form of training. 33todo: 3MS: For related work in general and this section in particular, we should highlight how this paper expands beyond or is different from existing work.
2.5. Bounding Box Annotation
Early improvements to object detection include improvements to the crowdsourcing task workflow. Object annotation workflow proposed by Su et al. (Su et al., 2012) entails three steps. First, a worker draws a bounding box around a single object instance. Second, another worker verifies the drawn box. Third, a different worker determines if there are additional instances of the object class that need to be annotated. The paper reports that 97.9% of images are correctly covered with bounding boxes.
Other approaches use computer vision methods to generate bounding boxes during the annotation process(Papadopoulos et al., 2016; Adhikari and Huttunen, 2020; Adhikari et al., 2018; Russakovsky et al., 2015). Prior work by Papadopoulos et al. (Papadopoulos et al., 2016) using an accept/reject decision could achieve high-quality results comparable to standard manual annotation. Similarly, Adhikari and Huttunen (Adhikari and Huttunen, 2020) propose a semi-automated batch-wise method where a subset of images are annotated and then used to train an object detection model that can generate bounding boxes for the remaining images. As the last step, generated annotations go through a manual verification where workers add/remove boxes as required. This method can reduce the manual effort by up to 75%.
Literature has also investigated how we could use different annotation strategies instead of the standard way of drawing a bounding box through click and drag interactions. For instance, bounding boxes could be auto-generated by asking workers to annotate four edge points (points belonging to the top, bottom, left-most, and right-most parts) of the object (Papadopoulos et al., 2017a). A similar approach uses a single point that corresponds to the center of the target object as opposed to four edge points (Papadopoulos et al., 2017b).
A key challenge in comparing our empirical results vs. those reported in prior studies is that they tend to report on datasets having few objects per image on average: 2.5 for PASCAL VOC 2007 (used by(Papadopoulos et al., 2016, 2017a)), 2.4 for 2012 (used by (Adhikari and Huttunen, 2020; Papadopoulos et al., 2017a)
) and 1.5 for ImageNet (used by(Su et al., 2012)) datasets (Liu et al., 2020). So while Papadopoulos et al. (2017a) report 88% mIoU annotation quality on PASCAL VOC 2017, this is a much easier task than ours. In contrast, Russakovsky et al. (Russakovsky et al., 2015) report 7 objects per image on average (similar to us), but they do not report annotator mIoU.
3. The Challenge of Variable Effort Labeling Tasks
3.1. Defining Variable Effort Labeling Tasks
While crowdsourced annotation is well-studied, variable effort tasks present three key challenges vs. more typical labeling tasks: inconsistent worker experience, the potential for high cognitive load, and effective incentive design. With regard to inconsistent experience, workers may implicitly expect all task instances to require comparable effort. Highly varying effort requirements across instances would violate such an expectation and could induce surprise or frustration. Secondly, as the cognitive load becomes excessive (e.g., labeling 1000 faces in a single image of a crowd), workers may not only be frustrated but naturally struggle to complete the task accurately. As for incentive design, the typical task-based pricing model ala MTurk assumes that all task instances are compensated at the same fixed rate. Since more effortful instances require more time to complete (accurately), this equates to a lower effective earning rate for workers. These challenges, taken separately and especially together, can have various negative impacts. Workers may choose not to accept a task or quickly abandon it. They might complete easy instances but skip over more effortful ones. They may fail to deliver quality work due to demanding cognitive load or simple lack of effort.
Such tasks exemplify the applicability of rationales to a broad class of Where’s Waldo? (Wikipedia, 2020) search problems of determining whether or not a given item contains entities of interest (e.g., does Waldo appear in a given image or video clip, do we hear his voice in a given audio recording, is he discussed in a given text, etc.). The larger the item, the greater the problem searching it. For example, imagine annotating all trees in massive satellite or aerial imagery, requiring annotators to zoom and pan around images. The search problem may be explicit – e.g., does an audio clip contain a bird call? – or implicit – e.g., rate a product from its description, where the primary task is to rate the item but the annotator must search the item for evidence to support their rating decision.
Framing this search problem lets us relate variable effort annotation tasks to a large body of related work mobilizing the crowd for distributed search of large search spaces. Classic examples include the search for extraterrestrial intelligence (SETI@Home) (SetiHome, 2021), for Jim Gray’s sailboat (Vogels, 2007) or other missing people (Wang et al., 2009), for DARPA’s red balloons (Tang et al., 2011), for astronomical events of interest (Lintott et al., 2008), for endangered wildlife (Rosser and Wiggins, 2019) or bird species (Kelling et al., 2012), etc. Attenberg et al. (2011)
asked the crowd to find examples on which classifiers erred. Across such examples, what is being sought must be broadly recognizable so that the crowd can accomplish the search task without the need for subject matter expertise(Kinney et al., 2008). Whereas the works above involve searching for an entity across domain instances, with variable effort labeling tasks, the challenge is searching within each instance for matching entities.
There is limited prior work that examines how crowd work quality can vary when attempting tasks that involve a variable effort.44todo: 4ML: Ho et al. (2015) and Ye et al. (2017) instead studied “effort-responsive tasks” In a study where workers are asked to annotate either 5 or 10 items in each HIT, Kazai (Kazai, 2011) shows that better results can be obtained when workers are not overloaded. Similarly, crowd workers make more errors in counting tasks that include a large number of target objects (Sarma et al., 2015; Das Sarma et al., 2016). Our work intends to systematically evaluate the impact of variable effort on outcome quality by using a task that involves 14 discrete effort levels and requires individual actions for each work unit in the task.
Other work that focus on task complexity or difficulty has implicitly explored the relationship between task effort and the data quality (Cai et al., 2016; Newell and Ruths, 2016; Aipe and Gadiraju, 2018; Yang et al., 2016). For instance, research shows how task ordering can impact the data quality when deploying tasks with varying complexity (Cai et al., 2016). While these attributes are closely related, task complexity is a different abstraction from the task effort. A task that requires more effort (e.g., annotating an image with 15 faces) is not necessarily more complex than a task that requires less effort (e.g., annotating an image with 2 faces).
3.2. Face Detection Task and Dataset
For our study, we chose the variable effort task of detecting human faces within an image and drawing bounding boxes around each face. Object detection is one of the most common tasks available on crowdsourcing platforms and is substantially more difficult and time-consuming than simpler tasks like multiple choice questions (Su et al., 2012).
We selected 140 images and ground truth data from the Open Images (Kuznetsova et al., 2020) dataset. The number of faces in the images we selected ranged from 1 to 14, with 10 images per face count. For each subset, we sorted images by ID and picked the first 10 images corresponding to a pseudo-random selection. Images with potentially ambiguous human faces, such as cartoon characters or statues, were excluded to allow for definitive quality assessments. 66todo: 6MS: Some of the boxes in Figure 1 are really hard to see. Can we re-render in a more obvious color? Let’s do this after internal submission Figure 1 shows image examples with low and high face counts respectively.
Workers completed the face detection task in a standard image annotation interface supporting basic operations such as creating, adjusting and deleting annotations, and zooming. We provided consistent base instructions for the annotation task across all experimental conditions, with some condition-specific instructions added where necessary. The instructions also included three correctly annotated example images(Doroudi et al., 2016) in each of the conditions.
Labeling tasks are typically crowdsourced based on MTurk’s pay-per-task pricing model assumes that all task instances are compensated at the same fixed rate. We assume this standard pricing model as our baseline, though we anticipate it may not be optimal for variable effort labeling tasks in which some tasks require much more effort than others. While a stronger baseline that adjusts payments (e.g., Fair Work (Whiting et al., 2019)) is desired, we select a more prevalent baseline to ensure the external validity of our results.
We grouped our image set into two distinct bins based on object count and defined a static payment amount for each bin. Workers received $0.16 for completing images with an object count between 1 and 7 and $0.44 for completing images with an object count between 8 and 14. This approach assumes an amortized pay of $0.04 per individual object label. A more basic alternative would have been to administer a constant pay of $0.30 per image without binning, but we discarded this design to compare interventions with a more competitive baseline. We used the standard object detection workflow offered on the Amazon Mechanical Turk platform for our baseline, and no visible gold was administered in this condition.
3.4. Experimental Setup88todo: 8TM: Danula, in the notebook with the data analysis, there are several cells dedicated to postprocessing the pool of workers to remove obvious spammers I believe. Is that mentioned somewhere in the experimental setup?
We conducted our crowdsourcing experiments on the Amazon Mechanical Turk (MTurk) platform. All experiments were deployed between 2 PM and 5 PM Pacific Time. Based on our work time estimates across all conditions, workers on average received an hourly pay of $10.44, whereas the federal hourly minimum wage in the US is $7.25. Experiments were open to a subset of MTurk workers (a cohort of more than 8000 workers) who had previously qualified for the bounding box task based on a 99todo: 9MS: Reviewers might ask about this quality assurance mechanism. Can we say more about it here? ML - this is important proprietary quality assurance mechanism. Each experimental condition was available to a unique worker pool created by segmenting the worker subset. Workers were free to complete as few or as many HITs as they liked. In each condition, images were presented in a random order. We employed a basic filtering step across all the experimental conditions where we removed obvious spammers by filtering out workers who completed more than 5 HITs and had an average mIoU below 25. We only ended up removing 3 workers across all conditions.
We use mean intersection over union (mIoU) as the primary outcome to compare work quality. mIoU is a well-established quality metric for object detection tasks (Kuznetsova et al., 2020). It is computed as the average overall IoU values for all bounding boxes in the ground truth answer key. For each unmatched ground truth box (false-negative), that box is assigned an IoU of 0. We also report task time as a secondary outcome defined as the time duration from when a worker accepts a task until the task is submitted. Since we keep the average pay per bounding box constant in all conditions, task payment is not reported as an outcome measure in the paper.
3.5. Findings and Discussion
Variation in output quality against task effort with a naïve baseline of equal pay for each task. Shaded areas correspond to standard error.
We obtained responses from 24 unique workers for the baseline. Figure 2 shows that as the number of faces per image increases, annotation quality (as measured by accuracy and recall) declines1010todo: 10ML: explain peak at 9.1111todo: 11ML: this error analysis is terse and speculative. Can we say concretely what we found? Data quality reduction in our object detection task can attribute to either workers producing annotations of low-quality or entirely missing particular target objects when there are many target objects present in the image.
As noted above, the baseline assumes fixed-pay for all instances, despite the variable effort required. It could be argued that variation in required effort amortizes over multiple tasks to produce fair pay when workers complete enough work. However, this assumption is questionable. First, prior work shows that requesters often produce poor estimates of average effort per item and tend to underestimate the true effort (Cheng et al., 2015). This can hinder the administration of fair pay in a systematic manner. Second, work on crowdsourcing platforms typically follows a power-law distribution where only a few workers complete the majority of work, and the majority of workers abandon a task early (Han et al., 2019). Hence, the amortization assumption may only hold true for a small portion of the worker population while the remaining workers receive pay disproportionate to their effort. Third, even if workers are initially motivated to complete a large number of HITs, drop-out may be encouraged if the first few tasks happen to require high effort.
While high label quality is always desired, the consistency in degradation observed in proportion to the increasing effort is noteworthy. Firstly, uniform labeling quality across all items is desirable, without any consistent biases. Secondly, if we imagine training a detection model on this data, it is particularly important to have accurately annotated images with larger object counts (Shao et al., 2019).
4. Investigating Task Designs for Variable Effort Labeling Tasks
Section 3.1 suggested three key challenges with variable effort tasks: inconsistent worker experience, the potential for high cognitive load, and effective incentive design. In this section, we investigate the potential of various best practice task designs for crowdsourced annotation to address these challenges.
To investigate the general question of how standard quality improvement methods perform with variable effort annotation tasks. We picked several crowdsourcing data quality improvement methods that are generalizable and straightforward to implement. Since appropriate pay for effort is a key driver for quality in crowdsourcing (Kazai, 2011), we chose to include two data collection designs—variable pay and post-task bonus—that aim to calibrate pay according to the required effort on a per-image basis. Following popular iterative and parallel design paradigms in crowdsourcing, we included two other designs—task decomposition and iterative improvement—that aim to standardize the required effort in each task unit. Finally, we add the visible gold design, which uses gold standard questions for testing and training. For each of the data collection designs outlined below, we collected three responses per image.
4.1. Variable Pay
For better per-item calibration of payment, a more sophisticated data collection design may aim to estimate the effort for each item in advance and adjust the payment on a per-item basis accordingly. These estimates can be produced either manually (e.g., via upstream crowdsourcing workflows (Bernstein et al., 2010)) or automatically (e.g., via pre-built object detection models (Borji et al., 2019)).
However, this data collection design introduces other challenges. First, a priori estimation of effort is a non-trivial task and can be costly when manual workflows are required. Second, since HIT payment is one of the parameters used by Amazon Mechanical Turk to group HITs in the platform, tasks with different effort levels are advertised separately, allowing workers to selectively focus on tasks with high pay and ignore low paying tasks.
We instantiated this data collection design in our study by setting the payment amount for each image proportional to the exact object count as defined in the available ground truth data. In particular, we pre-calculated pay per image by multiplying the true object count with the base pay of $0.04 per object (e.g., $0.04 for images with a single object and $0.56 for images with 14 objects).
4.2. Post-task Bonus
An alternative to a priori effort estimation is to decide an appropriate payment amount after a task has been completed (Yin and Chen, 2015). This can be accomplished by setting up a HIT with a flat base payment and advertising a post-task bonus to compensate workers for any work completed beyond the base payment.
There are two main challenges to this data collection design pertaining to the trust relationship between workers and requesters. On the one hand, workers need to trust a requester to deliver on their promise of administering a post-hoc bonus and to choose the bonus amount fairly. On the other hand, requesters rely on good-faith execution of the task (e.g., the number of objects labeled or time spent) to produce accurate estimates of effort and fair bonus amounts. To this end, labels can be verified in a secondary process, e.g., manual verification through other workers, incurring additional cost for the requester.
We implemented this data collection design leveraging ground truth information available in our dataset. In particular, we offered workers a total payment of $0.04 for each object labeled correctly as per the ground truth data. The flat base payment was set to $0.04 for all images difference between flat base payment, and total payment was administered as the post-task bonus.
4.3. Task Decomposition
The previous two data collection designs aim at adjusting payment to variable effort on a per-item basis. An alternative approach is to decompose tasks into fixed-size units with constant effort and to administer a constant pay amount per task unit.
Prior work has shown that task decomposition not only facilitates fair payment, but also aids workers in producing high-quality answers by managing cognitive load (Sarma et al., 2015). However, the process of decomposing tasks into smaller chunks of equal size is non-trivial, and prior work suggests that some tasks depend on the context, which may not be preserved during decomposition (Tong et al., 2018). Getting multiple individuals to attempt smaller portions of the same task can also help elevate the output quality (Sarma et al., 2015; Kittur et al., 2011; Teevan et al., 2016).
We implemented this data collection design using a two-step workflow. First, we identified all target objects in a given image. Second, we created sub-tasks where workers were asked to create bounding boxes with a pre-defined set of target objects indicated through point markers. In our experiment, we created HITs with a maximum of 3 target objects and a static pay of $0.08 per HIT corresponding to 2 target objects per image on average. We implemented two variants for identifying target objects in the initial step:
Oracle: In this variant, target objects were identified using the available ground truth data.
Manual: Since ground truth data is generally not available in practice, we implemented a second variant in which target objects were identified manually by workers through a separate upstream point annotation task.
In addition to the above variants, for scalability, task decomposition can also be achieved through automated estimation methods. For example, as the decomposition step does not require high accuracy, we can use a generic automated object detection model (Borji et al., 2019) to generate the object markers and decompose tasks.
4.4. Iterative Improvement
We also included a task design informed by iterative crowdsourcing workflows (Goto et al., 2016) where several workers contribute to the same task while each worker can see the results from the previous worker. Prior work shows that iteration can increase the response quality in specific tasks like writing (Little et al., 2010), brainstorming (Little et al., 2010), and translation (Ambati et al., 2012). Quality increase typically attributes to corrective actions taken by subsequent workers. In addition, as our task involves variable effort, we also leverage an iterative workflow to regulate the amount of work that each worker needs to complete in a single iteration.
In our experimental design, multiple workers iteratively annotate the same image. In each iteration, we ask workers to either annotate a maximum of N=3 new objects, adjust existing work or mark the task as completed indicating that there is nothing left to annotate. Each iteration was deployed as a single HIT, and we set a static pay of $0.08 per HIT corresponding to an average number of two target objects per image.
4.5. Visible Gold Questions
Finally, we also explored a scenario where gold standard answers are available for a small subset of items in the dataset. These items with known answers are injected into workers’ task queues to assess annotation quality on an intermittent basis. In addition to assessing quality, gold standard answers can be used to provide near real-time feedback to workers for each object they labeled (or failed to label), e.g., for each face in an image. We refer to this feedback mechanism as “visible gold”. Prior works (Le et al., 2010; Gadiraju et al., 2015; Doroudi et al., 2016) have investigated the use of visible gold for simple tasks with binary outcomes or multiple choices and focused on presenting visible gold questions upfront or with a static gold-to-task ratio.
Figure 3 shows our interface to provide feedback to workers using visible gold for the variable effort annotation task used in our study. Workers encounter the interface immediately after submitting their answer to a visible gold question. The feedback provided to workers included the number of objects missed (i.e., false negatives), the number of bounding boxes drawn by the worker that did not match any objects in the answer key (i.e., false positives), the accuracy for each bounding box correctly matching an object (i.e., for each true positive) and the average accuracy across all bounding boxes in the answer key (using 0% accuracy for false negatives). Accuracy for individual bounding boxes was calculated as “intersection over union” (IoU), i.e., the ratio between the area of overlap and the area of union of the worker annotation and the ground truth bounding box. IoU is a standard accuracy measure in object detection tasks. Finally, all ground truth bounding boxes were displayed to workers along with their own annotations as part of the feedback interface. While gold standard annotations were available for all images in our study, the interface was designed to dynamically decide whether a particular question should be a visible gold for a given worker based on the experimental condition.
As workers get immediate performance metrics through visible golds, they are motivated to produce high-quality work even when the task involves more effort. In addition, through the detailed feedback mechanism, visible golds can provide increased task clarity and help workers understand and improve their task performance. We expect that these characteristics will help set accurate expectations and motivate workers to carefully attempt tasks that involve an increased effort.
We conducted our crowdsourcing experiments on the Amazon Mechanical Turk (MTurk) platform and used a consistent experimental setup as described in Section 3.4. On average, workers received a payment of $0.04 per bounding box (e.g., a worker receives a payment of $0.4 for a HIT that includes an image with 10 bounding boxes).
presents a summary of results, including the mean and standard error for mIoU values and average task time for each condition. In each condition, mIoU values follow a non-normal distribution. A Mann Whitney test with Bonferroni correction for multiple comparisons shows that mIoU values are significantly lower in iterative improvement (), , , post-task bonus (), , , task decomposition oracle (), , , variable pay (), , and not significantly different to task decomposition manual (), , compared to the baseline (). However, mIoU values in visible gold regular (), , are significantly higher compared to the baseline.
|Condition||Mean (mIoU)||SE (mIoU)||Time (sec)|
|Post-task Bonus on work load||59.5||1.08||243.4|
|Task Decomposition Manual||71.7||0.74||613.8|
|Task Decomposition Oracle||67.5||0.90||557.7|
|Variable Pay on work load||69.1||0.79||227.2|
|Visible Gold - Regular||75.5||0.58||168.4|
Figure 3(a) shows how task time per bounding box varies according to the number of ground truth target objects available in the image.
5.1.1. Impact of Task effort
Complex crowdsourcing tasks also include variable efforts within HITs. In object annotation, the number of bounding boxes that needs to be annotated in each image can range from 1 to many. In Figure 4, we examine how the output quality varies when the task effort increases. We see that our improved visible gold method consistently outperforms other methods in terms of mIoU (Figure 3(c)) and recall at mIoU¿0.5 (Figure 3(b)).
Figure 5 shows how mIoU varies depending on the size of each ground truth bounding box. The worker output quality is relatively low for smaller objects. While this trend is visible across all the methods, the improved visible gold method performs well above the other methods regardless of the target object size.
5.1.2. Task Completion
Task completion patterns based on the task submit time for variable pay and baseline conditions are given in Figure 6.
5.2. Analysis of Findings
Based on the results presented in Table 1, visible gold method results in the highest mIoU across all the attempted quality improvement methods. All the other quality improvement methods fail to surpass the baseline when considering the mIoU value.
To produce high-quality annotations, we also want crowd workers to carefully complete the task by spending ample time. As seen in Figure 3(a)
, time spent on a work unit or a single object annotation diminishes when the number of objects in the image increases. On the contrary, task decomposition conditions that appear as outliers in Figure3(a) can ensure consistent work time on each unit when compared to other conditions. However, we did not obtain the desired quality improvement. As given in Table 1, mIoU values from task decomposition conditions are lower than the baseline condition.
As shown in Figure 4, output quality measured in mIoU and the percentage of annotated ground truth boxes drops as the number of target objects in images increases. While this general trend is present in all experiment conditions, Visible Gold, Baseline, and Task Decomposition (Manual and Oracle) perform relatively better compared to other methods. Output quality drastically drops with effort in iterative improvement and post-task bonus conditions. For post-task bonus, this is mainly due to the low task base-price. For instance, when a worker accurately labels an image with 12 target objects, they receive a base pay of $0.08 and a bonus of $0.40. Although we use a reputed requester account with a 99% approval rate, we can argue that workers are still unwilling to commit to a task with low specified payment and do additional work without a payment guarantee. The reason behind the observation in iterative improvement condition is not explicit. Prior work also notes that data quality can be reduced when subsequent workers are led down the wrong path in an iterative workflow of tasks with high difficulty (Little et al., 2010). Another possible cause can be workers failing to understand the task/instructions fully and confusing it with standard verification jobs, and prematurely marking the task as completed.
As a general trend, mIoU in our object detection task decreases when the target object is smaller. There are two possible causes for this observation. First, workers could completely miss smaller target objects during the annotation process when there are many target objects present in the image (as seen in Figure 4). Second, when the object is smaller, the impact on error is also higher (e.g., the error caused by missing the margin by a single pixel is higher for relatively smaller target objects).
In contrast to other conditions, workers could pick tasks with specific object counts under variable pay condition. In Figure 6, we observe that workers prioritized tasks with higher effort and higher pay.
6. Improving Visible Gold1414todo: 14ML: we deep dive on visible gold because both 1) it performed best; and 2) it is perhaps more general to benefit non-variable effort tasks as well, and more general solutions are preferrable
Our initial evaluation shows how visible gold method can result in annotations with higher quality in object detection task when compared to other quality improvement methods. Also, visible gold method that relies on gold standard questions is potentially more generalizable across many task types as opposed to other methods like task decomposition and iterative improvement. The applicability of visible gold is also not limited to tasks that involve a variable effort. These factors led us to further investigate visible gold as a promising generic quality improvement method for crowdsourcing.
In this section, we detail how we refined our visible gold method. First, we explore different visible gold issuing patterns. Second, we evaluate how bonuses and warnings work as consequences when using visible gold. Finally, we present an improved visible gold method that incorporates tier-based consequences and dynamic visible gold issuing.
6.1. Visible Gold Issuing Pattern
We evaluate three ways of issuing visible gold questions and obtained five responses per image in each condition.
Upfront: Workers complete a fixed number of visible gold questions at the beginning. This condition is comparable to the explicit training in previous work by Gadiraju et al. (Gadiraju et al., 2015). Upfront condition is straightforward to implement and can be considered as a variant of a qualification test.
Fib+Regular: We propose a new strategy that combines the characteristics of Upfront and Regular conditions. We seek to present more visible gold questions at the beginning such that we can test the workers reliably while providing ample training examples. But as workers continue, we want to test less frequently. We achieve this by following the Fibonacci sequence. Under this condition, up to 50 questions, we follow the Fibonacci sequence (i.e., 1,2,3,5,8,13..) to issue visible golds and then falls to a more infrequent regular visible gold to task ratio of 1:19.1515todo: 15ML: explain rationale for fibonnaci: more testing early and less over time, and why this seems good
6.2. Bonus vs. Warning as a consequence
For Upfront, Regular and Fib+Regular conditions, we used warning as the consequence where we warned workers that they would not be able to attempt future tasks if their outcome measured via quality checks (i.e., visible gold questions) does not meet the expected quality standard. However, we did not block any workers during the task or remove any contributions from our analysis. To examine if incentivizing workers to produce high-quality annotations works better than the warning, we deployed the following additional condition.
Regular Bonus: Similar to the ‘Regular’ condition, workers encounter a visible gold regularly with a visible gold to task ratio of 1:4. For the consequence, instead of the warning, workers receive a bonus payment of $0.08 (face count less than 8) or $0.22 (face count greater than 7) per image if they maintain an accuracy above a pre-specified threshold. Bonus amounts and thresholds were specified in the task description.
6.3. Dynamic visible gold and tier-based consequences
Based on the literature and results obtained in our first round of experiments detailed above, we further improved our visible gold mechanism by dynamically adjusting the visible gold issuing pattern and by adding a performance metric display element.
During our initial experiments, a handful of workers ignored the training and guidance provided by visible gold questions and continued to produce low-quality work. To counter this, we altered the Fib+Regular visible gold issuing pattern by adding bonusing and blocking conditions with pre-defined quality thresholds. Bonus thresholds and determined whether a worker qualified for a bonus payment. Blocking threshold was the minimum mIoU that a worker needed to achieve to pass a given visible gold. When a worker completed a visible gold with their mIoU for the current image falling below , we overrode the standard visible gold pattern and issued another visible gold as the next HIT. 1616todo: 16MS: Which visible gold pattern was used for conditions ”Visible Gold - Improved” and ”Visible Gold - Regular Bonus”? Upfront, Regular or Fib+Regular? 1717todo: 17ML: Improved is not a descriptive name for what was done We continued to issue visible golds until either the worker passed a visible gold or they met the blocking condition by failing three consecutive visible golds with an overall average accuracy below the blocking threshold .
To increase transparency, we added a dedicated performance metric banner at the top of the task interface as seen in Figure 7. A worker could see their current average accuracy and the relevant quality tier in a simplified manner. The content of the banner was updated as workers completed visible golds.
We ran additional experiments with the improved visible gold interface. We collected five responses per image and kept all other parameters regarding the experiment consistent with previous deployments detailed in Crowdsourcing Experiments (Section 3.4).
7. Evaluation II
A summary of results for visible gold conditions, including the mean and standard error for mIoU values and average task time is given in Table 2.
|Condition||Mean (mIoU)||SE (mIoU)||Time (sec)|
|Visible Gold - Regular||75.5||0.58||168.4|
|Visible Gold - Upfront||75.5||0.53||165.6|
|Visible Gold - Fib+Regular||75.7||0.58||177.0|
|Visible Gold - Regular Bonus||74.7||0.59||193.9|
|Visible Gold - Improved||79.3||0.41||168.0|
7.1.1. Visible Gold Execution
From our initial experiments to identify the suitable visible gold execution strategy, all strategies produced better outcomes when compared to the baseline. 2020todo: 20TM: Are these stats corrected for multiple comparisons? Obtained mIoU values in each condition follow a non-normal distribution. A Mann Whitney test with Bonferroni correction for multiple comparisons shows that mIoU visible-gold-upfront (), , , visible-gold-regular (), , , and visible-gold-fib+regular (), , have significantly higher mIoU values compared to the baseline ().
In order to identify the most suitable visible gold execution strategy, we separated answers into three bins based on the hit completion order and plotted the mIoU metric in Figure 8(a). In Figure 8(b), we also show the variation in the distribution of the total number of tasks completed by each worker.
7.1.2. Visible Gold Improved
A Mann Whitney test indicated that mean mIoU in improved visible gold condition () is significantly higher than mean mIoU in baseline (), , . Results in improved condition were also significantly higher than visible-gold-fib+regular condition (), , , which provided the best outcome in the first round of experiments.
7.1.3. Bonus vs. Warning
We compare between using warnings and bonuses as a consequence for failing visible gold questions. Our results show that there is no significant difference between bonus () and warning () conditions, , .
7.1.4. Impact of Task effort
In Figure 102323todo: 23ML: larger text labels, ”ground-truth”-¿”gold”, we revisit how the output quality varies when the task effort increases. We see that our improved visible gold method consistently outperforms other visible gold methods and the baseline in terms of mIoU (Figure 9(b)) and recall at mIoU¿0.5 (Figure 9(a)).
Figure 11 shows how mIoU varies depending on the size of each ground truth bounding box. The worker output quality is relatively low for smaller objects. While this trend is visible across all the methods, the improved visible gold interface outperformed other methods regardless of the target object size.
7.2. Analysis of Findings
We first examined the optimum visible gold execution pattern. Our analysis revealed that visible-gold-fib+regular is superior to standard upfront (Gadiraju et al., 2015) or regular (Le et al., 2010) methods. In addition to producing a marginally better overall mIoU score, the fib+regular method helps us preserve the data quality as workers continue to complete tasks. As seen in Figure 8(a), work quality declines as workers complete more tasks in both baseline and visible-gold-upfront conditions. However, in visible-gold-regular and visible-gold-fib+regular work quality remains steady as work continues. Fib+regular method is also more advantageous for jobs with a large number of HITs. For example, when a worker completes 100 HITs, regular pattern issues 20% visible golds, whereas fib+regular issues only 11%. In Figure 8(b), we also observe that a large portion of workers tend to leave the task after completing several HITs in visible-gold conditions. This positive observation confirms that certain workers left the task as they were confronted with quality checks.
Concerning bonus and warning, our results indicate no significant difference. We used these findings to inform the design of improved method where we used the fib+regular as the base visible gold issuing pattern and incorporated both bonus and warning with a tier-based design.
Table 2 presents the impact of different visible gold presentation strategies on the quality of responses for our object detection task. Note that there is little variation in average time per task among the different visible gold presentation strategies when compared with the other quality-improvement strategies (Table 1). On the other hand, the overall quality of annotations increased markedly once we implemented the improved task interface with different “tiers” of performance, suggesting that dynamic feedback with clear and transparent communication about penalties and rewards incentivizes higher quality. Variation in the cadence of visible gold presentation appears to have had less impact than the improvement in the task interface.
Figure 10 demonstrates that the improvement from the improved interface can be primarily attributed to better performance on the tasks requiring the most effort as measured by the number of ground truth boxes in the image. The dynamic tracking and reporting of the worker’s running accuracy score on gold data may have made the impact of a small number of poor annotations on a worker’s overall performance more clear, incentivizing increased attention to detail on the more difficult tasks.
In this paper, we investigated how data quality improvement mechanisms perform for tasks that involve variable effort. We evaluated common quality enhancement methods and showed that the visible gold method produced annotations of significantly higher quality. We further refined and evaluated the visible gold method, demonstrating the effectiveness of combining upfront and regular testing patterns. Our results also suggest that workers produce better annotations when confronted with consequences via the visible gold feedback interface. However, there was no detectable difference in how bonuses (for high-quality work) and warnings (for low-quality work) affected output quality in the context of visible gold.
8.1. Data Quality and Variable Effort Tasks
Our systematic evaluation shows that both object count and object size can impact the annotation quality in object detection task. Our results are consistent with prior work that shows data quality suffers when tasks involve increased effort (Das Sarma et al., 2016; Kazai, 2011). While there are numerous other crowdsourcing data quality improvement methods (e.g., work strategies (Han et al., 2020), task assignment (Hettiachchi et al., 2020)), in this work, we are primarily interested in methods that can potentially support variable effort crowdsourcing.
Initially, we hypothesized that data quality improvement methods that aim to either standardize the effort (e.g., task decomposition and iterative improvement) or match the pay according to the effort (e.g., variable pay and post-task bonus) should work better when compared to a baseline average pay scheme. However, as detailed in Table 1 and Figure 4, none of these methods were successful in surpassing the baseline. While literature (Bernstein et al., 2010; Little et al., 2010) highlights that these methods can improve data quality in specific tasks and scenarios, our work shows that such improvements may not hold when the task effort varies. Prior work also highlights that static workflows perform poorly for complex crowd tasks (Retelny et al., 2017). Therefore, it is important to consider the task effort when using crowdsourcing to obtain annotations as well as when evaluating methods related to crowd work on tasks like bounding box annotation.
However, we show that out of the detailed methods, visible gold is the most useful method in preserving data quality in variable effort tasks.
8.2. Visible Gold for Training
When creating the visible gold mechanism for object detection, we considered three dimensions highlighted in the crowdsourcing literature on worker feedback (Dow et al., 2012). In terms of timeliness, we designed our feedback to be synchronous. However, to prevent workers from guessing which tasks were visible gold questions, they were shown feedback only after completing the entire task, i.e., drawing all the bounding boxes for a particular image. Regarding specificity, we provided automated yet detailed feedback (Figure 3), including aggregate metrics on the image level and fine-grained metrics for each bounding box.
Our study extends prior work on visible gold (Gadiraju et al., 2015; Le et al., 2010) by integrating existing testing patterns into a hybrid pattern (Fib+Regular) with both upfront and regular testing. Our results demonstrate that this hybrid pattern is more effective at maintaining annotation quality over the course of large amounts of tasks. The capacity to sustain data quality is particularly important as crowd work typically follows a power-law distribution where only a few workers self-select to complete the majority of work, whereas the remainder of the workforce abandons a task early on (Rogstadius et al., 2011). Further, our results in Figures 11 and 10 show that the improved visible gold mechanism is robust when the task involves variable effort with respect to object count and object size. Our findings are in line with prior work that uses periodic bonus payments (Difallah et al., 2014) and achievement priming (Gadiraju and Dietze, 2017) to motivate high quality crowd work.
In addition to improving annotation quality, we argue that our visible gold method possesses a variety of positive attributes from a worker’s perspective. First, our method provides transparency around the expected annotation accuracy. Second, it provides task clarity (Gadiraju et al., 2017) to workers by demonstrating how a task should be done by means of concrete examples. Third, visible gold provides feedback to workers, helping them understand their individual task performance, correct specific mistakes, and improve their subsequent work. These three factors contribute to a better understanding of expected task outcomes and reduce the possibility of workers abandoning tasks (Han et al., 2019)
, leading to unpaid work. Finally, visible gold provides the opportunity for workers to give feedback to the requester if gold standard annotations are faulty or if evaluation results seem to be incorrect (e.g., if there is a bug in the evaluation metric code), which is not possible with hidden gold.
8.3. Implementing Visible Gold
Our visible gold method can be easily implemented by task requesters or crowdsourcing platform itself. We discuss several factors that should be carefully considered in a practical implementation.
Generalizability: We anticipate that it is possible to develop visible gold templates for many other tasks. Prior work demonstrates how this can be achieved for text-based classification tasks (Le et al., 2010; Gadiraju et al., 2015) and content creation tasks (Gadiraju et al., 2015). It would be straightforward to extend the current work for certain tasks like semantic segmentation and keypoint annotation (Kovashka et al., 2016) but would require additional effort if the task has no objective evaluation metric like accuracy or mIoU. While it is trivial to present feedback based on visible golds for object detection, additional explanations would be desired for certain complex tasks. Future work can explore how to augment visible golds with explanations or rationales.
In the experiment, we picked threshold values for bonuses and blocks based on percentiles in the baseline data. When implementing visible gold in a crowdsourcing platform, requesters can either set these values based on absolute quality expectations or calibrate thresholds based on initial insights from pilot jobs.
Generating Gold: As visible golds serve as training examples, it is also important to create a reliable set of visible gold standard questions. Like hidden gold questions, visible gold should be representative of the dataset and should sufficiently cover edge cases and ambiguous tasks. While our study assumes that there exists a single objective ground truth answer for any given task, in practice, many tasks are ambiguous (Schaekermann et al., 2018). Future work may test evaluation strategies that accept multiple valid ground truth answers (Chang et al., 2017; Schaekermann et al., 2020). The problem of generating high-quality gold standards is a non-trivial process and remains an open research challenge. An important challenge for future work is how visible gold mechanisms can work when available ground truth data are imperfect or noisy.
Presenting Feedback: The visible gold presentation in the current work includes detailed feedback for each visible gold answer, providing an overall accuracy score along with metrics and visual feedback for each work unit. We also added continuous feedback on quality checks during further refinements. However, we identify several future improvements for visible gold interfaces. First, future designs may introduce a ‘revise and resubmit’ mechanism. The design implemented in this study did not let workers adjust their original annotation after being presented with feedback on a visible gold question. Correcting their original annotation according to the gold answer could help workers understand how to achieve higher task accuracy through experiential learning. Second, the interface could be enhanced by adding interactive feedback. In the current implementation, workers only receive feedback on their work once they submit the answer. It is also possible to explore whether interactive feedback for partially completed tasks is helpful for the workers. This can be more meaningful for highly complex tasks such as bounding box or 3D point annotation tasks with a large number of target objects in the same image/task. Third, similar to worker-led instruction refinement (K. Chaithanya Manam et al., 2019) or workflow creation (Kulkarni et al., 2012), we could encourage workers to provide feedback on visible gold. If a worker encounters a flawed visible gold question, there should be a way to flag it or provide detailed feedback. This way, requesters can remove the reported visible gold questions from the task. As worker quality is measured through visible golds, this is an important enhancement when implementing visible gold at scale. In addition to the information provided regarding work quality, we could explore interventions such as micro-diversions (Dai et al., 2015), dedicated training sub-tasks (Doroudi et al., 2016), mandatory instruction documents, etc.
Testing Patterns: The adaptive gold execution strategy introduced in this work can be expanded by considering worker quality metrics outside the current job. For instance, if we already know a particular worker is doing well in object detection based on previous jobs, we can reduce the visible gold frequency. It is also possible to draw from prior work that investigates how to issue hidden gold questions optimally (Bragg et al., 2016). Finally, the positive impact of the tiering system in the improved task interface suggests two interesting directions for future study: (1) whether we can expand it to a platform-level system and (2) how we can estimate/dynamically vary threshold quality values and rewards to improve performance.
The use of visible golds also has several inherent limitations that practitioners should be aware of. With visible golds, workers can easily flag the HITs as Gold Questions, and with the help of third-party plugins, other workers may be able to detect in advance whether a particular HIT is gold or not (Checco et al., 2020, 2018).
We acknowledge several limitations of our study. First, in our crowdsourcing experiments, workers were allowed to attempt an arbitrary number of questions instead of being assigned a fixed quota. We made this design decision to match the typical workflow in crowdsourcing platforms and therefore ensure the ecological validity of our work. 2525todo: 25TM : I added a comment on the effect of allowing an arbitrary number of responses — can we say that it may have reduced the statistical power by increasing variance of our statistics?
: I added a comment on the effect of allowing an arbitrary number of responses — can we say that it may have reduced the statistical power by increasing variance of our statistics?As a result, the distribution of work between workers was uneven. Second, we did not specifically collect worker demographic information that may impact the output quality. However, we utilize a curated worker pool that excludes workers who would intentionally spam the task or produce low-quality data.
In this paper, we systematically evaluated the impact of existing quality improvement methods for tasks involving variable effort. Our results from a series of crowdsourced experiments in the context of object detection show that providing feedback to human annotators via visible gold can produce better quality outcomes than methods aiming to balance effort and pay at the item level through adjusting pay per item or decomposing the task into chunks of similar effort. We further designed and empirically evaluated variants of the visible gold method testing different issuance patterns and quality-related consequences. Our final design iteration of visible gold combined dynamic testing patterns with tier-based consequences and significantly improved bounding box accuracy by 5.7% compared to a basic visible gold variant and by 7.5% compared to a baseline without visible gold. Our work broadens the understanding of quality assurance processes for variable effort annotation tasks and emphasizes the value of visible gold-based feedback mechanisms in this process.
Acknowledgements.We thank the many talented Amazon Mechanical Turk workers who contributed to our study and made this work possible. We also thank our reviewers for their valuable feedback. We further acknowledge other members of the human-in-the-loop (HIL) science team for their valuable comments. Any opinions, findings, and conclusions or recommendations expressed by the authors are entirely their own.
Iterative bounding box annotation for object detection.
International Conference on Pattern Recognition (ICPR), Cited by: §2.5, §2.5.
- Faster bounding box annotation for object detection in indoor scenes. In 2018 7th European Workshop on Visual Information Processing (EUVIP), Vol. , pp. 1–6. External Links: Cited by: §2.5.
- SimilarHITs: Revealing the Role of Task Similarity in Microtask Crowdsourcing. In Proceedings of the 29th on Hypertext and Social Media, HT ’18, New York, NY, USA, pp. 115–122. External Links: Cited by: §3.1.
- Collaborative workflow for crowdsourcing translation. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work, CSCW ’12, New York, NY, USA, pp. 1191–1194. External Links: Cited by: §2.2, §4.4.
- Beat the machine: challenging workers to find the unknown unknowns. In Proceedings of the AAAI Conference on Human Computation, HCOMP, pp. 2–7. Cited by: §3.1.
- Soylent: A Word Processor with a Crowd Inside. In Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, UIST ’10, New York, NY, USA, pp. 313–322. External Links: Cited by: §2.2, §4.1, §8.1.
- Salient object detection: a survey. Computational Visual Media 5 (2), pp. 117–150. External Links: Cited by: §4.1, §4.3.
- Optimal testing for crowd workers. Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, pp. 966–974. External Links: Cited by: §2.3, §8.3.
- Amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data?. Perspectives on Psychological Science 6 (1), pp. 3–5. Cited by: §2.1.
- Chain Reactions: The Impact of Order on Microtask Chains. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, New York, NY, USA, pp. 3143–3154. External Links: Cited by: §3.1.
Revolt: collaborative crowdsourcing for labeling machine learning datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI ’17, New York, NY, USA, pp. 2334–2346. External Links: Cited by: §8.3.
- All that glitters is gold – an attack scheme on gold questions in crowdsourcing. In Proceedings of the Sixth AAAI Conference on Human Computation and Crowdsourcing, HCOMP. Cited by: §2.3, §8.3.
Adversarial attacks on crowdsourcing quality control.
Journal of Artificial Intelligence Research67, pp. 375–408. External Links: Cited by: §2.3, §8.3.
- Measuring crowdsourcing effort with error-time curves. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI ’15, New York, NY, USA, pp. 1365–1374. External Links: Cited by: §3.5.
- POMDP-based control of workflows for crowdsourcing. Artificial Intelligence 202, pp. 52 – 85. External Links: Cited by: §2.2.
- And now for something completely different: improving crowdsourcing workflows with micro-diversions. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, CSCW ’15, New York, NY, USA, pp. 628–638. External Links: Cited by: §8.3.
- Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. ACM Computing Surveys 51 (1), pp. 1–40. External Links: Cited by: §2.3.
- Towards Globally Optimal Crowdsourcing Quality Management. In Proceedings of the 2016 International Conference on Management of Data - SIGMOD ’16, SIGMOD ’16, Vol. 26-June-20, New York, New York, USA, pp. 47–62. External Links: Cited by: §3.1, §8.1.
- Scaling-up the Crowd: Micro-Task Pricing Schemes for Worker Retention and Latency Improvement. Second AAAI Conference on Human Computation and Crowdsourcing (Hcomp), pp. 50–58. Cited by: §8.2.
- Toward a learning science for complex crowdsourcing tasks. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, New York, NY, USA, pp. 2623–2634. External Links: Cited by: §2.4, §2.4, §3.2, §4.5, §8.3.
- Shepherding the Crowd Yields Better Work. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work, CSCW ’12, New York, NY, USA, pp. 1013–1022. External Links: Cited by: §2.4, §2.4, §8.2.
- iCrowd: An Adaptive Crowdsourcing Framework. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, New York, NY, USA, pp. 1015–1030. External Links: Cited by: §2.3.
- Improving Learning through Achievement Priming in Crowdsourced Information Finding Microtasks. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference, LAK ’17, New York, NY, USA, pp. 105–114. External Links: Cited by: §8.2.
- Training workers for improving performance in crowdsourcing microtasks. In Design for Teaching and Learning in a Networked World, pp. 100–114. External Links: Cited by: §1, §2.4, §2.4, §4.5, 1st item, §7.2, §8.2, §8.3.
- Clarity is a Worthwhile Quality: On the Role of Task Clarity in Microtask Crowdsourcing. In Proceedings of the 28th ACM Conference on Hypertext and Social Media - HT ’17, HT ’17, New York, New York, USA, pp. 5–14. External Links: Cited by: §8.2.
- Understanding crowdsourcing workflow: modeling and optimizing iterative and parallel processes. In Proceedings of the Fourth AAAI Conference on Human Computation and Crowdsourcing, HCOMP, Vol. 4. Cited by: §2.2, §2.2, §4.4.
- Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, pp. 172–179. Cited by: §2.1.
- Crowd worker strategies in relevance judgment tasks. In WSDM 2020 - Proceedings of the 13th International Conference on Web Search and Data Mining, External Links: Cited by: §8.1.
- All those wasted hours: on task abandonment in crowdsourcing. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, New York, NY, USA, pp. 321–329. External Links: Cited by: §3.5, §8.2.
- Combining crowdsourcing and google street view to identify street-level accessibility problems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’13, New York, NY, USA, pp. 631–640. External Links: Cited by: §2.3.
- CrowdCog: A Cognitive Skill based System for Heterogeneous Task Assignment and Recommendation in Crowdsourcing. Proceedings of the ACM on Human-Computer Interaction 4 (CSCW2), pp. 1–22. External Links: Cited by: §8.1.
- Incentivizing High Quality Crowdwork. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15, Republic and Canton of Geneva, Switzerland, pp. 419–429. External Links: Cited by: §2.1, ToDo 4.
- The labor economics of paid crowdsourcing. In Proceedings of the 11th ACM Conference on Electronic Commerce, EC ’10, New York, NY, USA, pp. 209–218. External Links: Cited by: §2.1.
- Enhancing reliability using peer consistency evaluation in human computation. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work, CSCW ’13, New York, NY, USA, pp. 639–648. External Links: Cited by: §2.3.
- Minimizing efforts in validating crowd answers. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, New York, NY, USA, pp. 999–1014. External Links: Cited by: §2.3.
- Pay Enough or Don’t Pay at All. Note: May 13. https://www.behind-the-enemy-lines.com/2011/05/pay-enough-or-dont-pay-at-all.html Cited by: §2.1.
- TaskMate: a mechanism to improve the quality of instructions in crowdsourcing. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, New York, NY, USA, pp. 1121–1130. External Links: Cited by: §8.3.
- In search of quality in crowdsourcing for search engine evaluation. In Advances in Information Retrieval, Berlin, Heidelberg, pp. 165–176. External Links: Cited by: §3.1, §4, §8.1.
- A human/computer learning network to improve biodiversity conservation and research. AI Magazine 34 (1), pp. 10. External Links: Cited by: §3.1.
- CrowdDQS: Dynamic Question Selection in Crowdsourcing Systems. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, New York, NY, USA, pp. 1447–1462. External Links: Cited by: §2.3.
- How evaluator domain expertise affects search result relevance judgments. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, New York, NY, USA, pp. 591–598. External Links: Cited by: §3.1.
- CrowdWeaver: visually managing complex crowd work. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work, CSCW ’12, New York, NY, USA, pp. 1033–1036. External Links: Cited by: §2.2.
- CrowdForge: crowdsourcing complex work. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST ’11, New York, NY, USA, pp. 43–52. External Links: Cited by: §2.2, §4.3.
- Crowdsourcing in computer vision. Now Publishers Inc., Hanover, MA, USA. External Links: Cited by: §8.3.
- Collaboratively crowdsourcing workflows with turkomatic. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work, CSCW ’12, New York, NY, USA, pp. 1003–1012. External Links: Cited by: §2.2, §8.3.
- Annotator Rationales for Labeling Tasks in Crowdsourcing. Journal of Artificial Intelligence Research (JAIR) 69, pp. 143–189. External Links: Cited by: §1.
- The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision 128 (7), pp. 1956–1981. External Links: Cited by: §1, §3.2, §3.4.1.
- Ensuring quality in crowdsourced search relevance evaluation: the effects of training question distribution. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pp. 21–26. Cited by: §1, §2.4, §2.4, §2.4, §4.5, 2nd item, §7.2, §8.2, §8.3.
- Galaxy zoo: morphologies derived from visual inspection of galaxies from the sloan digital sky survey. Monthly Notices of the Royal Astronomical Society 389 (3), pp. 1179–1189. Cited by: §3.1.
- Exploring iterative and parallel human computation processes. In Proceedings of the ACM SIGKDD workshop on human computation, pp. 68–76. Cited by: §2.2, §4.4, §5.2, §8.1.
- Deep learning for generic object detection: a survey. International journal of computer vision 128 (2), pp. 261–318. External Links: Cited by: §2.5.
- Scoring workers in crowdsourcing: how many control questions are enough?. In Advances in Neural Information Processing Systems, Vol. 26, pp. 1914–1922. Cited by: §2.3.
- WingIt: efficient refinement of unclear task instructions. In Sixth AAAI Conference on Human Computation and Crowdsourcing, HCOMP, Vol. 6. Cited by: §2.4.
- Design Activism for Minimum Wage Crowd Work. In Fifth AAAI Conference on Human Computation and Crowdsourcing (HCOMP): Works-in-Progress Track, Cited by: §2.1.
- Financial incentives and the performance of crowds. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pp. 77–85. Cited by: §2.1.
- How One Microtask Affects Another. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, New York, NY, USA, pp. 3155–3166. External Links: Cited by: §3.1.
- Programmatic gold: targeted and scalable quality assurance in crowdsourcing. In Proceedings of the AAAI Conference on Human Computation, HCOMP. Cited by: §2.3.
- We don’t need no bounding-boxes: training object class detectors using only human verification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 854–863. External Links: Cited by: §2.5, §2.5.
- Extreme clicking for efficient object annotation. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 4940–4949. External Links: Cited by: §2.5, §2.5.
- Training object class detectors with click supervision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 180–189. External Links: Cited by: §2.5.
- Toward crowdsourcing micro-level behavior annotations: the challenges of interface, training, and generalization. In Proceedings of the 19th International Conference on Intelligent User Interfaces, IUI ’14, New York, NY, USA, pp. 37–46. External Links: Cited by: §2.4.
- No workflow can ever be enough: how crowdsourcing workflows constrain complex work. Proc. ACM Hum.-Comput. Interact. 1 (CSCW). External Links: Cited by: §8.1.
- An assessment of intrinsic and extrinsic motivation on task performance in crowdsourcing markets. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 5. Cited by: §8.2.
- Crowds and camera traps: genres in online citizen science projects. In Proceedings of the 52nd Hawaii International Conference on System Sciences, Cited by: §3.1.
- Best of both worlds: human-machine collaboration for object annotation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2121–2131. External Links: Cited by: §2.5, §2.5.
- Surpassing humans and computers with jellybean: crowd-vision-hybrid counting algorithms. In Proceedings of the Third AAAI Conference on Human Computation and Crowdsourcing, HCOMP, Vol. 3. Cited by: §3.1, §4.3.
- Becoming the Super Turker:Increasing Wages via a Strategy from High Earning Workers. In Proceedings of The Web Conference 2020, WWW ’20, New York, NY, USA, pp. 1241–1252. External Links: Cited by: §2.4.
- Ambiguity-aware ai assistants for medical data analysis. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, New York, NY, USA, pp. 1–14. External Links: Cited by: §8.3.
- Resolvable vs. irresolvable disagreement: a study on worker deliberation in crowd work. Proc. ACM Hum.-Comput. Interact. 2 (CSCW). External Links: Cited by: §8.3.
- SetiHome. External Links: Cited by: §3.1.
- Objects365: a large-scale, high-quality dataset for object detection. In Proceedings of the IEEE international conference on computer vision, pp. 8430–8439. Cited by: §3.5.
- Crowdsourcing annotations for visual object detection. In Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, Cited by: §2.5, §2.5, §3.2.
- Reflecting on the darpa red balloon challenge. Communications of the ACM 54 (4), pp. 78–85. Cited by: §3.1.
- Supporting collaborative writing with microtasks. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, New York, NY, USA, pp. 2657–2668. External Links: Cited by: §4.3.
- SLADE: A Smart Large-Scale Task Decomposer in Crowdsourcing. IEEE Transactions on Knowledge and Data Engineering 30 (8), pp. 1588–1601. External Links: Cited by: §2.2, §4.3.
- Making better use of the crowd: how crowdsourcing can advance machine learning research. The Journal of Machine Learning Research 18 (1), pp. 7026–7071. Cited by: §2.1.
- Help find jim gray. Note: https://www.allthingsdistributed.com/2007/02/help_find_jim_gray.html Cited by: §3.1.
- Human flesh search model incorporating network expansion and gossip with feedback. In 2009 13th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, pp. 82–88. External Links: Cited by: §3.1.
- Crowd guilds: worker-led reputation and feedback on crowdsourcing platforms. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, CSCW ’17, New York, NY, USA, pp. 1902–1913. External Links: Cited by: §2.4.
- Fair work: crowd work minimum wage with one line of code. In Proceedings of the Seventh AAAI Conference on Human Computation and Crowdsourcing, HCOMP, Vol. 7, pp. 197–206. Cited by: §2.1, §3.3.
- Where’s Wally?. Note: Cited by: §3.1.
- Modeling Task Complexity in Crowdsourcing. HCOMP’16, pp. 249–258. External Links: Cited by: §3.1.
- When does more money work? examining the role of perceived fairness in pay on the performance quality of crowdworkers. Vol. 11. Cited by: §2.1, ToDo 4.
- Bonus or not? learn to reward in crowdsourcing. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pp. 201–207. External Links: Cited by: §4.2.
- Predicting crowd work quality under monetary interventions. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, HCOMP, Vol. 4. Cited by: §2.1.
- Reviewing Versus Doing: Learning and Performance in Crowd Assessment. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, CSCW ’14, New York, NY, USA, pp. 1445–1455. External Links: Cited by: §2.4.