Much of modern machine learning is built on the foundation of human-annotated data. As the application of these models has expanded into more socially embedded and contextually nuanced domains (Arora et al., 2020; Mitra and Gilbert, 2015; Nørregaard et al., 2019), collecting high quality, consistent, and robust data from human annotators has become an increasingly important yet challenging task (Bhuiyan et al., 2020). As one example, the ability to gather human evaluations of the toxicity of a piece of text is a necessary precursor to being able to build toxicity models to support online communities (Wulczyn et al., 2017)
as well as capture and mitigate harmful outputs generated by large language models(Gehman et al., 2020).
However, traditional rating methods commonly used today, like absolute or comparative rating, can produce inconsistencies in ratings across annotators and even with a single annotator’s ratings (Aroyo et al., 2019; Salminen et al., 2018). This is due to issues such as lack of a common interpretation of the scale in the case of absolute rating, as well as lack of global context in the case of comparative rating (Weijters et al., 2016; Welty et al., 2019; Clark et al., 2018). Additionally, while current rating methods can capture uncertainty in the ratings, it is difficult to dissect whether the uncertainty is a result of inherent ambiguity in the item—where certain items cannot be confidently distinguished from each other (Dumitrache et al., 2018a)—or from disagreement between annotators on where the item should be placed. Distinguishing these sources of uncertainty offers the potential of better capturing biases between annotators. It also allows us to develop more calibrated models that only make high-confidence distinctions between items when a human would have as well (Guo et al., 2017).
In this paper, we propose a new design for collecting scalar annotations called Goldilocks111Somewhat like Goldilocks in “Goldilocks and the Three Bears”, annotators must make use of multiple comparisons. that combines the ability to make direct comparisons between items with the simplicity of a continuous absolute rating scale (Figure 1). To accomplish this, Goldilocks introduces two main ideas—(1) Calibration using Prior Annotations: we provide previously annotated items as anchors to ground interpretations of the scale both within and across annotators. (2) Item-level Resolution Elicitation using Ranges: we use a two-step process to collect lower and upper bounds for each item instead of a single placement. Goldilocks combines strengths from both absolute and comparative ratings as annotators make multiple comparative judgments while placing an item on an absolute scale. In addition, by directly eliciting an annotator’s own judgment of an item’s inherent ambiguity instead of relying on aggregating inter-annotator agreement, Goldilocks can separate agreement from perceived ambiguity.
To understand the effectiveness of these designs, we conducted three studies comparing aspects of the Goldilocks annotation process against traditional methods. In the first experiment, we evaluated whether anchoring scales with a shared set of previously annotated items can improve consistency of item placement across annotators. In the second experiment, we examined whether including an annotator’s own prior annotations as anchors improves self-consistency. Our final experiment evaluated how well ranges captured using Goldilocks can recover the distribution of pairwise relationships as measured by traditional absolute and comparative rating. Each of our experiments were conducted in three domains representative of the subjective or ambiguous rating tasks that can be challenging for traditional methods: judging toxicity of online comments (short text), estimating satiety of food depicted in images (visual), and estimating age from portrait photos (visual).
From the experiments examining anchors, we found that the addition of shared example anchors to ground rating scales improves rating consistency between annotators in domains where shared understanding of the scale is low. We also found indications that showing one’s prior annotations in a session as additional anchors may improve self-consistency on examples where there is high initial uncertainty. From the experiment examining ranges, we found that our two-step range annotation process allows us to infer pairwise relationship distributions that are more robust—simultaneously reflecting both uncertainty of single annotators and disagreement between annotators—compared to alternatives with a single value. Finally, we found that the size of range annotations provides an interpretation of uncertainty that is distinct from the uncertainty modeled via inter-annotator disagreement.
We conclude with a discussion of the limitations and opportunities for Goldilocks. Regarding efficiency, while our approach is more costly than performing just one of absolute or comparative rating, our method is cheaper than performing both, which would be necessary to recover the richer data that Goldilocks generates. We discuss cases where a deeper understanding of uncertainty can be important for generating more trustworthy model predictions. We also discuss what we envision as a scaled-up Goldilocks workflow: utilizing iterative improvement through multiple annotation sessions with designs for bootstrapping the initial set of anchors along with interesting problems to be explored in each of these aspects.
2. Related Work
In this section we review prior work on: (1) growing demand for consistent and robust human rating, (2) prior work building on absolute and comparative rating designs, (3) uncertainty and disagreement in crowd annotation, and (4) making use of uncertainty from human annotators in downstream machine learning tasks.
2.1. Demand for Improving Human Rating
There is a growing demand for human annotation in domains involving ambiguous or subjective examples, largely due to rapid progress in machine learning. Human rating annotation has been used to create or validate a variety of training data, for example, in the domains investigating toxicity (Wulczyn et al., 2017), misinformation and credibility (Bhuiyan et al., 2020; Mitra and Gilbert, 2015), and emotionally manipulative text (Huffaker et al., 2020). However, there is also increasing concern for the robustness of datasets collected (Welty et al., 2019) and whether nuances like uncertainty are being represented (Aroyo and Welty, 2015).
Direct human rating of model output has also become prevelant in the evaluation of high performance models where automated metrics (e.g., BLEU, METEOR) start to fail (Callison-Burch et al., 2006; Agarwal and Lavie, 2008; Denkowski and Lavie, 2010)
. For example, human rating has been used to evaluate aspects of generative tasks (e.g., summarization, translation) in natural language processing by capturing characteristics like fluency, relevance, and conciseness which cannot be easily and reliably assessed with automated metrics(Graham et al., 2013). Human rating has also commonly been used to evaluate the output of chatbots (Sedoc et al., 2019) or to judge search results (Huffman, 2008) or cluster quality (Zhang et al., 2018). Increasingly, human ratings (both comparative and absolute) are becoming an integral aspect in facilitating comparisons between models through evaluation leaderboards and shared tasks (Specia et al., 2020; Khashabi et al., 2021), where consistency and robustness of comparative results are crucial.
2.2. Absolute and Comparative Rating Designs
One of the most common designs for collecting human ratings today is through absolute rating scales, often in the form of Likert or semantic differential scales (Likert, 1932; Osgood et al., 1957). When a consistent interpretation of the scale can be established across annotators, designs based on absolute rating can offer many benefits such as being very efficient (only requiring a single annotation per item) and providing easily interpretable ratings that are globally contextualized (rather than depending on other items). However, many annotation domains do not have commonly accepted scales, meaning that divergent interpretations of a scale based on abstract text descriptions can become a source of disagreement and inconsistency across annotators (Weijters et al., 2016). Even within an annotator’s own annotations, the lack of a well defined scale means that to maintain consistent ratings, they must refer to their own memory of their past decisions which can be unreliable (Brown et al., 2007). Accounting for these inconsistencies requires additional effort—either through additional calibration (Gardner and Martin, 2007) or just identifying and reporting them (Geva et al., 2019). Absolute scales can also be locally unreliable (Welty et al., 2019)—because items are only ever compared against the scale’s anchors, pairwise comparisons between two items with similar values can only be rigorously done if the measurement resolution (uncertainty around the values) is also accounted for.
As many consistency problems in absolute rating systems result from the lack of direct comparisons between actual items, a natural solution is to look towards the other major alternative—comparative ratings (Thurstone, 1927). In comparative rating systems, items are compared against against one another directly, circumventing the need for a scale as a proxy and providing highly reliable measurements of local relationships. This kind of comparison can also be more intuitive for annotators leading to comparative systems sometimes suggested as a more accurate alternative for ranking items (Kiritchenko and Mohammad, 2016; Liang et al., 2020). However, collecting comparative ratings can be considerably more costly (on the order of comparisons per item) unless sampling and ranking aggregation methods or partial comparisons, which trade off additional uncertainty, are used (Jin et al., 2020; Kiritchenko and Mohammad, 2016). The focus on local comparisons makes it easy for an annotator to inadvertently produce annotations that are not globally self-consistent, requiring post-hoc corrective action that may not reflect an annotator’s actual judgment. Abandoning global context also means that if a rating score (rather than ranking) is desired, a numeric mapping like Elo rating needs to be done (Clark et al., 2018), which often come with assumptions about uniform spacing between items.
Past work has explored hybrid approaches that combine aspects of comparative and absolute annotation. For example, Sakaguchi et al. (Sakaguchi and Durme, 2018) present EASL, a hybrid approach where items are rated using continuous absolute scales but similar items are grouped together for annotation allowing for some degree of comparison and contextualization. While similar in motivation, our work differs in that we make comparison an integral part of the annotation process rather than an optional source of context, allowing us to provide more consistency by grounding comparison against global anchors and capture uncertainty intuitively by using comparisons to establish bounds.
Beyond the individual drawbacks mentioned above, neither of the two traditional annotation methods supports effective separation of the sources of uncertainty as a part of the the annotation process (Hullermeier and Waegeman, 2019). These sources include both aleatoric uncertainty, or irreducible ambiguity inherent to the item being rated, and epistemic uncertainty, or disagreement on the placement of the item. Absolute rating forces annotators to resolve inherent ambiguity into a precise placement causing both sources of uncertainty to be mixed. Meanwhile, comparative rating only provides an indirect view into inherent ambiguity through the size of equivalence sets. Separating the two sources of uncertainty is especially desirable as it can be an important tool for understanding properties of the items being annotated separate from biases or divergent interpretations among annotators.
2.3. Addressing Uncertainty and Disagreement
Uncertainty and disagreement has been a long recognized challenge when collecting crowdsourced annotations of all kinds. Early work in crowdsourcing focused on measuring perception-based objective aspects of items, taking the view that the uncertainty observed as disagreement between workers is the result of random noise from unreliable perception. To address this kind of uncertainty, methods such as majority voting, expectation maximization(Dawid and Skene, 1979), Max-Margin Majority Voting (Tian et al., 2019)
and even active learning based approaches(Lin et al., 2016) have been proposed, which attempt to improve the quality of the true measurement signal by aggregating across more annotators and accounting for the varying degree of noise introduced by each annotator. More recent lines of work recognize the deficiencies in single value answers, proposing instead to use answer distributions in the form of allowing multiple labels (Jurgens, 2013; Dumitrache, 2015; Dumitrache et al., 2018a) to capture the sources of uncertainty more comprehensively rather than attempt to remove it. Generally, these aggregation methods rely on large amounts of redundancy; thus, a major focus of prior work in this area has been in improving the efficiency of annotation work though collecting more information for each item or information about more items in each annotation task (Chung et al., 2019; Kiritchenko and Mohammad, 2016).
Another view focuses on the idea that disagreements can arise from divergent interpretations of the data and task specification among workers (Kairam and Heer, 2016; Gordon et al., 2021; Aroyo and Welty, 2015) or even within an annotator as they are exposed to more data. One prior line of work, structured labeling (Kulesza et al., 2014; Chang et al., 2017), proposes that tools and techniques can be designed to assist people in reconciling the evolving interpretations of data both individually and collectively when labeling or generating taxonomies. Kuleza et al. (Kulesza et al., 2014) note that maintaining consistency in annotation can be challenging even for experts. This motivated our exploration of improving self-consistency by incorporating past annotations.
Rubrics (Yuan et al., 2016) and training (such as via gated instructions (Liu et al., 2016)) have also been proposed as an effective way to unify understanding of the task requirements across workers. In fact, in practice, task designers oftentimes expend significant effort building detailed rubrics with complex training and gating processes. Prior work in this area has explored how to reduce such burdens on task designers through collaboratively creating rubrics with workers (Bragg et al., 2018). However, strict rubrics are often undesirable when the goal is to elicit human judgments on properties that are difficult to rigorously define such as those involving subjective interpretation.
A softer form of rubrics can be made by using in-domain ground truth (or “gold”) examples to anchor the interpretation of tasks including those involving scales. Gold solutions created by experts are often used during training (Doroudi et al., 2016) in lieu of or in addition to instructions and rubrics. Reference examples can also be provided during the task such as in the MUlti Stimulus test with Hidden Reference and Anchor (MUSHRA test) (Völker et al., 2018). However, existing methods depend on curated or synthesized fixed gold anchors ahead of labeling, which, in the case of scale anchors for subjective domains, still requires a concrete definition of the scale ahead of time. Fixed anchors are also limited in the support they can provide for self-consistency.
Finally, on challenging or high-stakes domains where correctness is important, deliberation has been proposed as a way of addressing and resolving disagreement directly. Deliberation processes can range from simple one-shot reconsideration prompts (Drapeau et al., 2016) to more complex multi-turn discussions (Chen et al., 2019; Schaekermann et al., 2018). Alternatively, lighter methods have been proposed that model behavior of humans to identify when disagreement is likely (Gurari and Grauman, 2017). However, these methods can still require significant worker effort.
2.4. Recognizing and Leveraging Uncertainty in ML
As it is impossible to eliminate uncertainty (Hullermeier and Waegeman, 2019), downstream tasks like machine learning have started to explore paths of utilizing uncertainty information (when it is available) during training and evaluation. Many machine learning models have been built to do tasks like classification (Bouveyron and Girard, 2009), ranking (Yan et al., 2007) or regression (Yan et al., 2008) using uncertain labels.
Recently, there is growing interest in understanding and mitigating adversarial attacks on machine learning (Goodfellow et al., 2015). These attacks often trick models to make high confidence predictions that are incorrect due to limited ability for many models to accurately model its own confidence. One potential mitigation strategy has been to look at improving a model’s robustness by improving ability to model uncertainty (Qin et al., 2020). Additionally, there is a push for models to more accurately understand when it should be unsure (Rajpurkar et al., 2018). These all motivate an increasing need to understand what humans find uncertain and calibrated ways for humans to convey their degree of uncertainty.
Absolute rating can suffer from inconsistent scale interpretations while comparative rating lacks global context. Our design for the Goldilocks annotation system takes a hybrid approach, with the specific goals of: (1) improving consistency (between annotators and over time within annotator), and (2) enabling intuitive indication of uncertainty with respect to the scale for each example being labeled.
In this section, we will describe the designs that address each of the goals above followed by additional aspects of operating the complete annotation workflow. At the end, we will discuss specific details of the design decisions we made for our implementation separate from the overall design of the Goldilocks annotation process.
3.1. Grounding with Prior Examples
We base the main interactions in Goldilocks around an absolute rating design. To mitigate the aforementioned drawbacks of absolute rating, Goldilocks uses prior examples in addition to abstract descriptions to ground the scale, making it possible to make pairwise comparisons while still using absolute rating interactions. Prior work has shown that human judgments measured explicitly with comparisons can be easier than direct labels for some tasks (Simpson and Gurevych, 2018; Zhang et al., 2017; Wah et al., 2014), and fixed reference anchors have been used in other procedures to provide a more concrete grounding of scales (Völker et al., 2018). Similar ideas that use comparisons against samples to contextualize abstract scales also exist in other fields like cognitive psychology (Stewart et al., 2006).
Goldilocks uses a set of previously-annotated examples to add two additional pieces of information to the absolute rating scale—global grounding and local comparisons, as shown in Figure 2. With global grounding, a small set of representative examples are selected and placed as anchors along the rating scale, similar to existing text-based anchors for levels in traditional absolute rating. Using concrete examples allows annotators to quickly understand and estimate where each item could fit on the scale. Since there can be many previously-annotated examples, we make sure to only visualize a smaller subset of examples (around 5 to 7, similar to typical numbers of Likert levels) that are maximally spread out along the scale. In practice, there are many ways to select these examples. The specific selection process we used is outlined in 3.4.
While global grounding is useful for making coarse placements, it alone is insufficient for narrowing down specific placement of items. To help the annotators find specific placements, Goldilocks also surfaces local comparisons by showing the immediate neighborhood above and below a position on the scale. As annotators scrub along a continuous scale, we show side-by-side comparisons between the current indicated position and the closest items above and below this position. Placements of these neighbors are also indicated on the scale itself, allowing for annotators to adjust proportional distance to each neighbor based on their evaluation of the item being placed. These designs together allow for a more consistent and concrete instantiation of the scale across multiple annotators.
Finally, Goldilocks addresses local self-consistency by supporting dynamic augmentation of the anchor examples used to ground the rating scale: as annotators progress in an annotation session, their own annotations for earlier items are also incorporated into the set of references alongside any pre-seeded ones (Step 3 of Figure 1). These personal annotations will then also take part in both global grounding and local comparisons, making it possible to directly compare new items against past annotations produced in the same session.
One potential limitation for any annotation process involving examples is how to start the annotation when no past examples are available. Goldilocks accounts for this with a separate procedure to curate an initial seed set that is deployed when past examples do not exist. We will dive into more detail about the selection of this initial seed set of items to jumpstart annotation in Section 3.3. In the discussion section, we will also discuss avenues of addressing other challenges in example-based grounding such as scaling up annotation with iterative improvement and addressing density as the scale becomes populated with more annotated examples.
3.2. Two-Step Range Annotation
Not all items can be meaningfully distinguished from all other items by an annotator. Instead of forcing the breaking of ties, most designs for side-by-side comparisons allow annotators to indicate “indistinguishable” or “tied” pairs (Läubli et al., 2018)—however, there is no such elicitation process for traditional absolute rating designs. With Goldilocks, we propose a new process that allows annotators to indicate “indistinguishable” pairwise relationships on an absolute rating scale. To achieve this, we take inspiration from prior work (Dumitrache et al., 2018b), where annotators were asked to select all potentially relevant labels for an item instead of a single best label option. We extend this into the continuous scale domain by introducing the concept of eliciting “range” labels—where upper and lower bounds establish a subsection of the scale representing where an item could be placed. Our range-based approach is also reminiscent of methods like best-worst scaling (Kiritchenko and Mohammad, 2016) in comparative rating, which can efficiently capture pairwise relationships across many items.
Prior designs have explored alternatives to eliciting uncertainty for scalar annotations, such as in the form of a weighted distribution across surrounding anchor labels (Chung et al., 2019). However, estimating distributions in this way can be challenging for humans, as an annotator has little guidance on how to allocate weight to the anchoring labels they find reasonable. In Goldilocks, we can take advantage of the comparisons affored by grounding examples to contextualize distributions intuitively. Specifically, we break down the process of eliciting ranges into two steps: finding the lower bound and then finding the upper bound (Steps 1 and 2 in Figure 1). In the first step, an annotator can utilize the past example anchors to quickly search for where to place the lower bound of an item using comparisons to work up the scale and finding the position where they can no longer confidently decide that the closest reference should be lower on the scale than the annotated item. Similarly, in the second step, an annotator establishes the upper bound working down from the scale and stopping when they can no longer identify a reference item as higher than the annotated item.
Positions of anchor items on the scale are themselves internally represented by ranges. During each step, the anchors are visualized using the corresponding opposing bound: when finding the lower bound for an item, anchor items are placed on the scale according to their upper bound values and vice versa for the upper bound (shown in Figure 1). This two-step process allows an annotator to easily establish a range that is intuitive and meaningful—it represents the range where the annotator is no longer able to confidently distinguish items.
3.3. Cold Start Process
Annotation of any item in the Goldilocks process requires there to be previously-annotated items using the same scale in order to populate the grounding examples and comparisons. However, if prior annotations do not exist yet, they must be created in what we call the cold start process.
The cold start process (shown in Figure 3) consists of two steps—representative example selection and placement on a scale. In the example selection step, Goldilocks draws a certain amount of un-annotated examples randomly from the set of data to be annotated. An annotator can then adjust this set by requesting to draw additional random examples or dropping existing examples. The goal is to adjust this set to be more representative such that there are at least a certain number of examples in the set (defined based on task) and that the examples are maximally different from each other. A similar sample and replace approach was used in Alloy (Chang et al., 2016) to bootstrap good seed sets for clustering. In the placement step, the annotator successively places all the examples onto an absolute rating scale by comparing them against each other, with the ability to adjust the position of any item on the scale. The scale can be blank at the outset or be initialized with text anchors as shown in Figure 3.
The cold start process can be completed with recruited annotators, where the resulting placements are aggregated across them to create the set of seed examples that become the first set of Goldilocks example anchors. Alternatively, the cold start process can be completed by the task designer or by domain experts, making it a way for requesters to specify a scale without having to design a set of training instances. In this case, the steps in the cold start process are used to assist the exploration of the dataset. Once additional items have been annotated using Goldilocks, the set of anchor examples can be augmented with this newly annotated data. If desired, the initial seed examples can be re-annotated by removing them from the scale and re-introducing them as new items to be placed in an iterative improvement fashion.
3.4. Implementation Details
As our experiments were conducted on the Amazon Mechanical Turk crowdsourcing platform, we also implemented a gated training (Liu et al., 2016) phase for each of the annotation experiments. This phase focuses on training the crowd workers to use the annotation interface rather than annotating a specific task domain, so we used a common training example based on age estimation across all domains. Workers are presented with a prompt describing the task and interface, including specific actions that can be performed using the interface. As workers complete each annotation step for the training task, we check their partial answers against the reference and provide just-in-time feedback if they make a mistake. Once the worker accurately completes the training task, they will progress into the actual annotation task and given the specific instructions for the domain they are annotating. We implemented some basic quality control measures to prevent gaming of the task such as requiring workers to have interacted with the slider before they are allowed to proceed onto the next item.
In order to answer the research questions behind our Goldilocks designs, we conducted annotation experiments using data from 3 domains on the Amazon Mechanical Turk (AMT) platform and using interfaces that isolate specific aspects of Goldilocks for experimentation. Specifically, we tested the following hypotheses:
RQ1: Does grounding with examples improve consistency?
H1-a: Using example-based anchors reduces the amount of disagreement between annotators on ratings of items compared with using semantic text descriptions as anchors.
H1-b: Including an annotator’s own annotations from the session as additional anchors results in improved self-consistency reflected by less disagreement with their past placement when placing items again.
RQ2: Does the range-based process create robust output for understanding relationships between items?
H2-a: Range annotation captures item resolution and thus can more accurately model distributions of pairwise relationships (more than, less than, indistinguishable) compared to distributions produced by comparing single value annotation output.
H2-b: Resolution of items captured using range annotation are better for modeling pairwise relationships than resolution captured through inter-annotator (dis)agreement.
RQ3: Does the uncertainty about items captured through the size of the ranges correlate with uncertainty captured in the form of inter-annotator disagreement in traditional semantic scale absolute ratings?
4.1. Annotation Task Design
We describe in more detail the task design we used in our annotation experiments, including interfaces derived from Goldilocks and ones from traditional annotation. Unique crowd workers were recruited to use one of the following interfaces to provide annotations for a group of examples:
Single Value with Semantic Anchors (SV-SA): In each step, annotators are are asked to find a slider position that represents the placement of one item in the annotation sequence using a semantic scale as reference (Figure 4 top).
Single Value with Example Anchors (SV-EA): In each step, annotators are asked to find a slider position that represents the placement of one item in the annotation sequence using a scale anchored by other example item instances (Figure 4 bottom). Depending on the experiment and condition, the annotator’s past placements in earlier steps may become additional anchors for steps in the future.
Pairwise: Annotators were asked to compare all pairs of items. For each step in the annotation sequence, an annotator was presented with 1 reference item and a list of items it has not been compared to yet. For each item, the annotator was asked to judge the relationship of that item compared with the reference item (,,).
Range with Hybrid Anchors (R-HA): This represents the full proposed Goldilocks design. Annotators are given both semantic labels and example instances as reference anchors. For each item, an annotator is first asked to place a lower bound marker for the item followed by placing an upper bound marker. Ranges annotated in earlier steps are incorporated as additional anchors.
Our first study (4.5) examines whether example anchors (SV-EA) improve agreement between annotators compated to semantic anchors (SV-SA). Following that, our second study (4.6) examines whether including an annotator’s past placements improves within annotator consistency when using the SV-EA annotation design. Finally, in our last study (4.7), we collect ground truth pairwise relationships directly using the Pairwise interface, and compare how well we can recover the distribution of these relationships using data from the traditional single-value semantic anchor approache (with SV-SA) with that of the full Goldilocks range annotation design (R-HA).
In all cases, annotators were first given a brief gated “interface training” instructional stage where they are guided to annotate a single item (based on an age estimation domain) using the annotation interface they were assigned. Instructions are provided during the process to guide them through using the interface and feedback is given if the annotator makes a mistake in the annotation. Once an annotator completes the annotation process without mistake, they are given details about the actual task domain they are annotating. Each annotator is then prompted to annotate a sequence of items using the assigned condition’s interface.
4.2. Annotation Domains and Datasets
We selected the following 3 annotation domains to to conduct annotation tasks: toxicity, satiety and age. These domains were selected to represent common types of rating tasks that have subjective aspects where a Goldilocks style approach to annotation could be desirable. These tasks also span two different modalities, short text and image, which closely align with rating tasks commonly conducted.
For this task domain, annotators judge the degree of toxicity in a short online comment, estimating how strongly the author of the comment intended to offend. Research has demonstrated that human judgments of online toxicity vary considerably from rater to rater due to subjectivity of the task (Salminen et al., 2018). The toxicity domain represents a short text annotation task where annotators compare pieces of text that only consist of a couple of sentences. Similar tasks include judging fluency of text generation or judging text sentiment. To produce the annotation dataset for this domain, we sampled a 50:50 label-balanced subset of 100 comments from the Jigsaw comment toxicity classification challenge dataset (Wulczyn et al., 2017) behind the Perspective API222https://www.perspectiveapi.com which contains Wikipedia comments and binary labels of toxicity. Only comments that had between 4 and 280 characters (after markup removal) were sampled. When presenting the task to crowd workers, we borrow Perspective API’s definition of a toxic comment: ‘a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion’. We also contrastively define healthy comments as those ‘relevant to the discussion’ and further note that comments ‘can express disagreement’.
For this task domain, annotators judge how filling (satiable) is the food depicted in an image, taking into account the type of food and the portion size. The satiety domain represents an annotation task that contains uncertainty in the visual modality. Prior research has shown that while pairwise comparisons of food for expected satiety can result in robust ratings, personal familiarity also resulted in biases (Brunstrom et al., 2008). We produced the annotation dataset by selecting a subset of food types from the Food-101 dataset (Bossard et al., 2014) and then sampling images for each selected food type up to a total of 80. One round of manual inspection was also done to verify food was clearly discernable in all images.
For this task domain, annotators estimate the age of the subject depicted in a photo. The age domain is another annotation task in the visual modality that contains uncertainty, however age is grounded to a concrete scale that we expect most people to be already familiar with. We produced the annotation datset by sampling a subset of 100 portrait images from the FG-NET face dataset (Fu et al., 2014).
4.3. Anchors for each Domain
To maintain consistency across experiments, we defined a set of text-based semantic differential scale anchors and a set of example anchors for each domain that was held constant across experiments. For the semantic scale anchors, we used text descriptions similar to 7-point Likert or semantic differential scales. Example anchors consisted of 7 roughly evenly spaced in-domain items each associated with a position on the scale.
For the toxicity domain, we used the following text descriptions for semantic scale levels: “1 - Not Toxic at All”, “4 - Somewhat Toxic” and “7 - Extremely Toxic”. Other levels (2, 3, 5, 6) on the scale were presented as a number without any associated description. The 7 example anchors were manually picked from a set of annotated examples produced from a pilot run of the cold start process with crowd annotators.
For the satiety domain, we used the following text descriptions for semantic scale levels: “1 - Very Hungry’, “2 - Somewhat Hungry”, “3 - Almost Satisfied”, “4 - Satisfied’, “5 - Full”, “6 - Very Full”, and “7 - Can’t Finish”. The 7 example anchors were produced by the authors producing gold annotations directly using the cold start process interface.
For the age domain, we used text scale levels based on numeric age values ranging from “0” to “60+” incrementing in steps of 10. The 7 example anchors were picked by finding all images corresponding to each semantic age level and then drawing a random one at each level and assigning its value to be the ground truth age.
4.4. Crowd Annotator Recruitment and Compensation
We recruited annotators for our experiments from the Amazon Mechanical Turk crowdsourcing platform from the United States with the qualification of approval rate no lower than 90% and over 1000 approved HITs completed in the past. Across all studies, annotators were only allowed to participate in annotation if they had both not used the corresponding interface and not annotated the domain before. Overall, we recruited 655 unique workers across all 3 studies with an additional 44 unique workers who only participated in the pairwise annotation used to establish the ground truth for Study 3. For all annotation tasks, we set a base pay of $0.10 which was given if the worker completed the training phase. Remaining compensation was distributed in the form of a bonus based on the interface being used and the number of items annotated.
Participants assigned to the Single Value tasks (both with Semantic and Example anchors) were given a per-item bonus of $0.03 (for annotating a group of 10 or 20 items). Participants assigned to the Range tasks were given a per-item bonus of $0.05 (a total of 10 items). Participants assigned to the Pairwise annotation tasks were given a per-comparison bonus of $0.01 (a total of 45 comparisons). We set pay based on our estimate of time needed taken from pilot studies and used completion bonuses to correct for any discrepancies. Based on condition, a final completion bonus of $1.00, $0.50, or $1.00 for each of the previously mentioned interfaces respectively was provided. We distributed the final bonus in 2 batches as the initial completion bonus values we set for the tasks resulted in a measured hourly pay that was lower than desired. The final hourly rate measured between $9.70 and $10.90 across the various domains and interfaces when assuming the median work time for each interface.
Manual quality checks were conducted on cases with a large number of similarly annotated values across different items (e.g., consistently placing at 0 or 1) as well as abnormally short work time, resulting in removal of 5 workers (and re-collection of corresponding annotations) across all experiments. Removed workers were included in the counts of recruited workers above. Within the removed workers, those intentionally spamming across their entire sequence of annotations (choosing the exact same placement for all items) only received the base pay for the task.
4.5. Study 1: Evaluating Consistency Between Annotators
We first explore whether example-based grounding presented in Goldilocks can improve consistency between different annotators (H1-a). For this experiment, we assigned each annotator to one of two conditions: semantic, where they were given 7-point text-based semantic anchors and presented with the SV-SA interface; or example, where they are given 7 example instances placed onto the scale using the SV-EA interface. For each domain, the anchors used are detailed in 4.3. We drew example anchor instances for the toxicity and satiety domains from past pilots of semantic differential scale annotation on a disjoint set of items, using average rating to establish their initial placement. For the age domain, example instances were selected from a separate set of images drawn from the same dataset using the included ground truth age labels for initial placement.
After the training, each annotator was tasked with annotating a sequence of 10 items using the interface of the condition they were assigned. To create sequences, each domain’s dataset was shuffled once and then partitioned into equal-sized disjoint sets. Each sequence for each domain was annotated by 10 workers in each of the two conditions. Annotators’ placements of items on the scale was mapped as a continuous numeric value within the range [0, 1]. For the toxicity domain, the first and last items in each sequence were set to the same item to pilot measurement of within-annotator consistency, so only the 8 remaining annotations were used for analysis in this experiment.
|Domain||Condition||Avg. Disagreement||Significance||Scale Util. (, )|
|toxicity||semantic||0.07348||P ¡ 0.001||0.773 (0.103, 0.876)|
|example||0.06379||Very Significant||0.794 (0.104, 0.899)|
|satiety||semantic||0.06373||P ¡ 0.005||0.603 (0.230, 0.833)|
|example||0.05548||Significant||0.635 (0.166, 0.801)|
|age||semantic||0.02765||P ¡ 0.001||0.696 (0.054, 0.751)|
|example||0.04443||Very Significant||0.593 (0.072, 0.665)|
conditions. Average disagreement is calculated as the standard error (over 10 annotators) for each instance averaged across all annotated instances. Significance testing done as a paired t-test across conditions for disagreement. We also examine how much of the 0-1 scale is being used by annotators on average in each condition by averaging each annotator’s minimum and maximum rating values.
To evaluate the amount of consistency between annotators for each annotated data point, we computed the standard error across annotators as a proxy for the amount of disagreement. We note that the standard error values are comparable across conditions as the range of values on the scale and number of annotators was fixed between all conditions. We also evaluated the significance of any difference by conducting a two-tailed paired t-test on the standard error of each annotated item across each pair of conditions (semantic versus example) in each domain. A summary of the results are shown in Table 1.
We observed a statistically significant decrease in value disagreement across annotators for the toxicity and satiety domains, providing support for hypothesis H1-a. However, we observed a statistically significant increase in disagreement across annotators for the age domain, which contradicts H1-a. We then plotted the disagreement (standard error) in both conditions for each item against the mean value across both conditions in each domain to understand the behavioral differences we see with the age domain as shown in Figure 5.
We find that the pattern for disagreement in the semantic condition is consistent with behavior observed in prior work (Welty et al., 2019) for similar domains with subjectivity and uncertainty. However, we note that overall disagreement between annotators was lower in the age domain compared to the other two domains. We also noted that scale utilization was similar in both conditions for the toxicity and satiety domains, exhibiting a slightly increase in utilization of the full scale in the example condition. Prior work in psychology has shown that increased spacing of items has relatively minimal effect on accurate placement when items are discriminable (Stewart et al., 2005) so we don’t expect this slight increase in scale utilization to affect disagreement levels. However, opposite to the other domains, the utilization of the scale in the age domain was 10% lower for the example condition. We hypothesize that unlike the toxicity and satiety domains, estimating age from appearance is a domain where a numeric age scale is actually more consistently understood by human annotators, thus example anchors provide no further benefit to annotators in understanding the scale. The scatter plots in Figure 4(c) indicate that uncertainty for younger subjects was much higher in the example condition. Combined with the lower scale utilization we observed for example, we hypothesize that uncertainty about judging exact age is higher for older subjects. As we only show example-based anchors in the example condition, this increased uncertainty about the reference images depicting older subjects may have resulted in more hesitation to use the higher values on the scale. This suggests that: (1) comparisons with anchor examples mostly benefit cases where shared understanding of the scale is low, and (2) example-based anchoring should be used in addition to semantic anchors as only using example anchors can be detrimental to consistency if the domain is one where the semantic scale has a high degree of shared understanding already. Drawing from this experiment, our full Goldilocks annotation process uses both example-based anchors and semantic anchors to frame the scale.
4.6. Study 2: Evaluating Consistency Over Time Within Annotator
For our second experiment, we explored the effect on self-consistency resulting from including an annotator’s own past annotations as additional reference examples augmenting an initial seed set (H1-b). The example-based SV-EA interface was used for this experiment, with each annotator was assigned one of the two conditions: control, where only the seed set examples was used for reference anchors; or augment, where an annotator’s own past annotations in the same session were included along the seed examples as references. Since we are interested mainly in the effect on self-consistency, we reduced the initial set of seed examples to just 3 examples for each domain drawn as a subset of the 7 example instances used in the example condition of the previous experiment. We took the items corresponding to the lowest, highest, and median ratings.
The items in each domain were shuffled and then partitioned into sequences of size 20, resulting in 5 sequences for the toxicity and age domains and 4 sequences for the satiety domain. Each annotator was given interface training and then subsequently tasked with annotating one of the sequences (of 20 items). To probe for changes in the rating of an item, we replaced the 10th and 20th items in each sequence above with repeats of the first item, which we will refer to as the probe item. When the probe item is annotated in the augment condition, the annotator’s own past annotation for the probe item will be withheld from the set of reference items. We measure as the size of the value change between the first and second annotation attempts of the probe item and as the size of the value change between the second and third annotation attempts of the probe item.
From Table 2 we can see that for most domain condition pairs, the absolute amount of an annotator’s disageement with their past rating tends to exhibit a natural decrease as they get familiarized with the scale. Since the magnitude of initial self-disagreement for the probe item varies for each annotator, comparing absolute change in self-disagreement can be misleading as the same proportional change in self-disagreement will reflect as a larger absolute change. To account for these factors, we instead look to the self-disagreement ratio () as a measurement for the proportional decrease (or increase) in self-disagreement. Ratios below 1 indicate that self-disagreement has decreased while those above 1 indicate an increase. In Figure 6, we show a histogram of this ratio on a log-scale for each condition in this study.
Our first step is understanding whether self consistency improves over time simply from doing the task and being exposed to more examples. We conducted a sign test for each of the task domains and find that in the toxicity domain, self consistency does improve over time (P ¡ 0.005) for both control and augment conditions. Self consistency was not found to have a significant across-the-board improvement in any of the other domains. Comparing across the two conditions, we did not measure significant effect on self-disagreement ratio in any of the 3 domains.
|Domain||Condition||Avg.||Avg.||Top Avg.||Top Avg.|
|toxicity||No Self (control)||0.105||0.062||0.244||0.077|
|With Self (augment)||0.133||0.090||0.347||0.095|
|satiety||No Self (control)||0.140||0.086||0.308||0.171|
|With Self (augment)||0.126||0.052||0.286||0.027|
|age||No Self (control)||0.110||0.063||0.280||0.133|
|With Self (augment)||0.063||0.066||0.157||0.067|
We then hypothesized that effect on self-consistency may not be uniform across all probe items—if an annotator already has low self-disagreement in the first re-annotation round (), it likely implies there is little uncertainty about the placement of the item and thus we shouldn’t expect further improvements. Considering this, we now look at only the top 30% ‘most uncertain’ annotation sessions for each domain and condition combination, as sorted by decreasing . This set consists of 15 sessions for the toxicity and age domains and 12 for the satiety domain. In this high-disagreement subset of sessions, we find that augmenting reference examples (augment) with past annotations in the session does result in a larger proportional reduction in self-disagreement (reflected through self-disagreement ratios) when compared to control for both the toxicity and satiety domains. For the satiety domain, median proportional decrease in self-disagreement was ( reduction in self disagreement) for the augment condition compared to ( reduction) for the control. The median ratios were ( reduction) and ( reduction) respectively for the toxicity domain. However, the limited amount of data points in these groups means we do not have statistical power to claim significance. Overall, we don’t find sufficient support for H1-b, but we note a pattern of improvement in self-consistency for items with high initial self-disagreement when including an annotator’s own prior annotations as additional references. Similar to the previous section, we were unable to observe benefit of augmenting reference examples on the age domain, likely due to the already limited utility of reference examples in this domain.
4.7. Study 3: Evaluating Range Annotation
For the final experiment, we explored how robustly ranges produced by the two-step annotation process in Goldilocks reflect properties of relationships between items. In this experiment, annotators were asked to annotate a sequence of items using the full Goldilocks two-step annotation process (using the R-HA design shown in Figure 1). The annotation experiments were conducted on the toxicity and satiety domains with sequences generated by shuffling each dataset and partitioning the dataset into groups of size 10, resulting in 10 and 8 groups respectively for the two domains. We then recruited 5 annotators to annotate each sequence in each of the domains.
At the start of the task, each annotator was first trained on how to use the two-step annotation system described earlier in Section 3.2 by annotating a sample task with guidance given during each step. After the annotator completes the training example item, they then proceed to annotate the assigned sequence of 10 task items. To seed the initial reference examples, we used the same reference anchors as used in the first experiment. We also included each annotator’s own annotations as anchors during their annotation in a similar way as the augment condition in the second experiment.
4.7.1. Establishing Pairwise Relationship Distributions
In order to measure ground truth distributions over the pairwise relationships, we recruited separate workers and used the Pairwise design to directly collect pairwise judgments on relationships (,,) between all pairs of items in each group. Distributions across the 3 relationship types were then created by counting the proportion of annotators indicating each type of relationship across for each pair of items. These distributions reflect the degree of disagreement among annotators for the pairwise relationship.
We then considered how one would recover similar distributions across relationships for pairs of items using the traditional approach of single-value absolute rating scales based on semantic anchors, creating two alternative baselines. Since the traditional approach cannot simultaneously elicit item ambiguity and agreement, producing a similar distribution would involve a tradeoff.
For the Direct baseline, we assume that there is no item-level ambiguity, meaning that even local pairwise comparisons can be made by directly comparing the raw values from the absolute rating. For example, we count an annotator as indicating a “” relationship on a pair if their single rating scores indicate . One can generally expect this to be reliable when and are far apart on the scale but it can be much less reliable for close neighbors.
For the Infer baseline, we assume that all disagreement observed between annotators reflects the ambiguity of the item. In this case, we aggregate the individual ratings into a single
95% confidence interval for each item by measuring the mean and standard error between these samples. We then infer the relationship between of a pair of items by comparing the confidence intervals, treating overlapping intervals as indicating a relationship of ‘indistinguishable (
)’. In this case, the distribution across relationships for a pair would see all the probability mass allocated to the single relationship measurement produced by the comparison.
Finally, with Goldilocks annotation, we have range evaluations on a per-annotator granularity. For each annotator, we can use their range labels to find the relationship between two items, treating overlaps as indicating . We can then produce a distribution by counting the proportion of annotators indicating each relationship. With Goldilocks we don’t need to make tradeoffs between measuring item ambiguity and agreement.
4.7.2. Results: Recovering relationships between items
|Domain||Avg. WD (Range)||Avg. WD (Direct)||Avg. WD (Infer)|
To compare and quantify how robustly each of these methods recovers relationships, we measured the Wasserstein distance between relationship distributions for each of the 3 approaches in 4.7.1 and the ground truth relationship distributions collected through pairwise comparative rating. Table 3 shows that among the 3 methods to produce distributions over pairwise relationships, recovering distributions using range labels most accurately agrees with the ground truth distribution, supporting H2-a.
We found that using inter-annotator agreement to infer the inherent ambiguity (referred to by prior works as “resolution”) of items results in an over-estimate of the amount of ambiguity. In the toxicity domain, 43.5% of the relationships that were distinguished in the ground truth distribution collected directly through pairwise comparisons were inferred to be “indistinguishable”, with this ratio as high as 68.1% in the satiety domain. In contrast, ranges over-estimate ambiguity (under-estimating resolution) only about half as often, with 22.1% and 30.9% respectively. This supports the idea that ranges are a better model of resolution (H2-b).
4.7.3. Results: Comparing aggregation uncertainty with range sizes
Finally, we explored differences in the type of uncertainty measured through Goldilocks annotation ranges sizes with uncertainty measured by confidence intervals of annotations using semantic scales. We hypothesize that since ranges focus on capturing resolution (distinguishability against peers) of items, the resulting uncertainty represented by the size of ranges will be different than uncertainty represented by inter-annotator disagreement metrics, though the two may still be related.
First we look at the behavior of the two kinds of uncertainty measurements across the range of values on the scale. Figure 7 plots the two kinds of uncertainty: average size of ranges and 95% confidence intervals for semantic scale annotation values. We find that overall range sizes represent uncertainty lower than that measured by 95% confidence intervals from aggregating semantic scale annotation (P ¡ 0.001). This makes intuitive sense as we would expect item level resolution to be a tighter uncertainty. We also find that in the toxicity domain, both types of uncertainty behave similarly with respect to extreme values on the scale corresponding to lower values of uncertainty in definitions. In the satiety domain, however, we found that lower values (corresponding to foods depictions that are less satiating) corresponded to larger uncertainty in the form of disagreement but not with range sizes. We think this may result from higher disagreement about what foods are not satiating among different annotators but with annotators each confident about their own determination of satiety (high resolution/distinguishablity of items).
Looking at correlation between the values produced by the two types of uncertainty, we observe only very weak correlation between range sizes and confidence intervals (scaled standard error) for both toxicity and satiety domains with in both domains. This indicates that the uncertainty we measure with ranges does not have significant correlation with inter-annotator disagreement measures like standard error (RQ3). We note that with range annotations, inter-annotator disagreement measures can be further computed for the range bounds themselves to evaluate disagreement separately from item uncertainty (resolution) captured by ranges. However, as single-value semantic scalar annotations can’t facilitate separation of the two uncertainty types, we are unable to make direct comparisons.
In the prior sections, we demonstrate that the ideas of grounding absolute rating scales with examples and explicitly capturing item-level measurement resolution can be beneficial for more consistent and robust annotation of subjective domains lacking shared understanding of absolute ratings scales. In this section, we will discuss some of the other considerations in adapting Goldilocks as a full annotation technique, including examining the annotation efficiency (in terms of work time) of Goldilocks compared to hybrid application of traditional methods and envisioning how Goldilocks may be scaled up to multiple annotation sessions using iterative-improvement processes. We will also discuss limitations of the Goldilocks process and potential avenues for future work.
5.1. Annotation Efficiency and Cost of Range Annotations
One of the main advantages of the Goldilocks annotation process is the ability to capture item-level ambiguity and disagreement between annotators simultaneously through the use of range annotations. However, separating these sources of uncertainty comes at an extra cost for the data collection process—even though range bounds in Goldilocks can be collected with low overhead compared to traditional absolute rating, the tasks can be more work for the requester to set up. This presents a tradeoff for practitioners when deciding whether the higher quality of data is worth the cost. Prior work simulating data annotation tasks inspired by measuring objective properties has shown that, given a fixed budget, some learning algorithms actually benefit more from a larger amount of lower-quality annotations on novel examples rather than higher-quality annotations on fewer items (Lin et al., 2014). Indeed, for these tasks where disagreement is likely caused by noisy perception, it’s likely that a practitioner will see relatively little benefit by separating item-level ambiguity from annotator disagreement. However, with the rising demand for training data in domains that involves subjectivity or nuance, understanding and accounting for sources of uncertainty and limitations within the data itself has become increasingly important for building models that are trustworthy rather than just more performant (Bhatt et al., 2020). Separating disagreement from inherent ambiguity using range-based annotation can also offer better transparency about the annotation process and data produced, allowing for the potential to diagnose model limitations and human biases even into the future. In these cases, the higher cost of setting up Goldilocks annotations can be justified by the richer information that can be derived from range-based rating data.
Of course, Goldilocks is not the only approach to capture both item-level ambiguity and disagreement. It is possible to use traditional absolute and comparative rating to separately collect scalar annotations and pairwise comparisons to recreate absolute rating estimates and pairwise relationship distributions. We also wanted to understand whether Goldilocks can provide efficiency benefits when compared to hypothetical hybrid approches using only a combination of traditional annotation interfaces. We look at the work time taken by crowd workers in our various experiments to extrapolate the effort necessary for such an approach. Assuming a task group size of 10 items, we find that the Goldilocks two-step workflow results in a median work time (including both training and annotation) of s per worker per task group on the satiety domain and s per worker per task group on the toxicity domain. Collecting only single value rating annotations with Likert-style anchors takes a median work time of s per worker per task group on the satiety domain and s per worker per task group in the toxicity domain. Finally, comparative rating on a group of size 10 implies 45 pairwise comparisons to capture full pairwise relationships, which takes a median time of s per worker per group and s per worker per group for the two domains respectively. Thus we expect that at the same level of redundancy for annotations, Goldilocks can be 20-48% more efficient through the use of our two-step range-based annotation that collects ratings and relationship distributions together. Consistency improvements of Goldilocks may be able to push efficiency further in practice by requiring a lower amount of redundancy to achieve the same level of agreement.
5.2. Goldilocks and Iterative Improvement
So far in this paper we have examined the ideas presented in Goldilocks only for single annotation sessions where we didn’t need to update the anchor examples beyond incorporating an annotator’s own ratings. In order to scale up to larger datasets, it becomes necessary to perform annotations over multiple sessions which involve using aggregation approaches to iteratively construct an updated set of anchors. To achieve this we envision a process based on the idea of iterative improvement (Little et al., 2010).
In each round of iteration, a group of annotators individually annotate a subset of the dataset, sharing a ‘seed’ set of anchor examples used to ground the interpretation of the scale, with their own annotations also incorporated as they progress along the annotation session. Once all annotators have completed the session, the annotations collected will be aggregated into a new set of seed examples used to ground the next round of iteration. In addition to progressively annotating new examples, this iterative process may also be used to revise past annotations, such as those created during the cold start process. This can be accomplished by first removing the items to be revised from the set of grounding examples and then re-annotating them as new items in an iteration. This process of periodically aggregating annotations and then re-seeding anchor examples can serve as a method to scale up annotations while ensuring a stable scale as annotators place items.
We believe that this represents a feasible design for scaling up annotation, and we envision further work can be done to explore options for aggregation and re-annotation strategies as well as evaluate their effectiveness. We also see potential for using iterative improvement as way to dynamically re-calibrate the definition of scales to account for distributional shifts over time. For example, a scale that can dynamically adapt to improving quality of machine summarization systems can be adapted as a living benchmark. We think the ideas presented in Goldilocks for single annotation sessions provide a first step into building an effective iterative workflow.
5.3. Limitations and Future Work
While Goldilocks provides a path to more consistent scalar annotation that also captures uncertainty, we also recognize that the current design is still subject to some limitations which we believe can be good avenues for future work.
5.3.1. Creating High Quality Seeds in Cold Start
The cold start process in Goldilocks provides a way to generate the initial seed set of grounding examples that enable the comparisons and consistency benefits of Goldilocks. However, the quality of this initial set of seed examples can also influence whether consistency benefits can be realized. We observed some of these limitations when experimenting with example-based anchors in the age
domain. A good seed set should consist of examples that achieve both good coverage of the scale and have low ambiguity themselves. When the seed set achieves good coverage over the scale, the comparative process can allow seed examples that are distinguishable to quickly be excluded from the range of the annotated item, resulting in measurement resolution that mainly depends on the number of examples in the seed set. However, a set of examples that is not representative of the full range of items to be ranked can lead to issues of scale drift when these examples (that annotators may desire to rate higher or lower than the current implied bounds of the scale) are encountered in the future. The current cold start process provides some mitigation to the issue of representativeness by incorporating a ‘resampling and replace’ phase to increase the diversity of items in the seed set. However, for sufficiently large datasets this may not be enough to capture rare items that are also outliers for the scale. For future work, we envision enabling the ability for annotators to rescale the visible scale itself through an interaction similar to zooming in or out, allowing the annotation of items that lie outside the current extremes of the scale when they are encountered.
Another current limitation of the cold start process is that the cold start design cannot effectively capture item ambiguity as we only elicit a single label for each reference item. In pilot studies we found it infeasible to introduce ranges into the cold start process as there are no anchors to compare against to effectively determine these ranges. It is possible to have suboptimal seed sets where the seed items can have high ambiguity themselves, thus acting as a lower bound on range sizes. We hypothesize that the iterative improvement process in 5.2 may offer a way to limit the impact of the cold start seeds if we can conduct subsequent annotation rounds where we can instead seed with regular annotated range data, though we leave exploration of this to a future study.
5.3.2. Addressing Long-form Tasks and Context
Some common tasks where crowd scalar ratings are desirable, such as evaluating relevance, conciseness, fluency, or faithfulness of summaries produced by text summarization models, can depend on understanding long-form context (e.g., a news article) or even multiple documents(Fabbri et al., 2020). While we have shown that Goldilocks can support annotation domains based on small amounts of text (1-2 sentences) using a similar interface as the one used for images, long-form text will require a different design for conducting comparisons both with the global scale and local neighborhood.
Additionally, interactions in Goldilocks assume that items can be compared against other items in the same dataset. However, when rating items with context, such as summarization or translation, it is likely that reasonable comparisons can only be made with certain other items sharing the same context (i.e., alternate summaries/translations of the same source). A potential avenue for future work extending Goldilocks may exist in introducing virtual views to the Goldilocks scale that enable contextual comparisons on the scale by only exposing items sharing the same context. Future work on an algorithm for determining optimal global example anchors could also take into account aspects that could make comparison easier, such as similarity to the item being annotated.
5.3.3. Working with Density
One of the strengths of Goldilocks is the ability to use past annotations from any source, including data from existing datasets to establish grounding for a scale. By providing past annotations from a dataset as reference examples, it will be possible to augment the dataset in a way that is consistent with past examples but also doesn’t require building complex rubrics. However, as the set of past annotations increases, it poses potential problems for the local comparison aspect of the Goldilocks annotation process. There are practical limitations on how fine adjustments can be on a slider-based scale, so as regions on the scale become densely populated by examples, it becomes harder to use local comparisons to find precise upper and lower bounds in those regions. Even small adjustments in a dense region can mean moving across many reference points.
One potential solution to the density problem could come from allowing the scale to be itself scaled, similar to that proposed in 5.3.1. Initially the full view of the scale is presented along with global anchors for coarse navigation. As an annotator narrows down on a dense region, they can increase the zoom level of the annotation scale to span just the dense region across the entire width of the scale, increasing the amount of space and in turn reducing interaction issues caused by density. New global anchors can be selected to allow for quick navigation at the new zoom level.
In this paper, we present and evaluate Goldilocks, a novel technique to elicit scalar annotations using the crowd that improves on consistency and captures pairwise relationships more robustly. We show that by prior examples can be used as anchors to ground otherwise abstract absolute rating scales (such as semantic or Likert scales) leading to more consistent interpretation between workers. We find that including an annotator’s past annotations in a session can lead to more self consistency on items that have high initial uncertainty. Finally, we show that introducing range annotation into absolute rating can enable simultaneous elicitation of both perceived ambiguity on a per-annotator scale while also capturing inter-annotator disagreement. This simultaneous measurement enables a better recovery of pairwise relationship distributions.
Meteor, m-bleu and m-ter: evaluation metrics for high-correlation with human rankings of machine translation output. In WMT@ACL, Cited by: §2.1.
A novel methodology for developing automatic harassment classifiers for twitter. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pp. 7–15. Cited by: §1.
- Crowdsourcing subjective tasks: the case study of understanding toxicity in online discussions. In Companion Proceedings of The 2019 World Wide Web Conference, pp. 1100–1105. Cited by: §1.
- Truth is a lie: crowd truth and the seven myths of human annotation. AI Magazine 36 (1), pp. 15–24. Cited by: §2.1, §2.3.
- Uncertainty as a form of transparency: measuring, communicating, and using uncertainty. ArXiv abs/2011.07586. Cited by: §5.1.
- Investigating differences in crowdsourced news credibility assessment: raters, tasks, and expert criteria. Proceedings of the ACM on Human-Computer Interaction (CSCW). Cited by: §1, §2.1.
Food-101 - mining discriminative components with random forests. In ECCV, Cited by: §4.2.2.
- Robust supervised classification with mixture models: learning from data with uncertain labels. Pattern Recognit. 42, pp. 2649–2658. Cited by: §2.4.
- Sprout: crowd-powered task design for crowdsourcing. Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. Cited by: §2.3.
- A temporal ratio model of memory.. Psychological review 114 3, pp. 539–76. Cited by: §2.2.
- Measuring ‘expected satiety’in a range of common foods using a method of constant stimuli. Appetite 51 (3), pp. 604–614. Cited by: §4.2.2.
- Re-evaluation the role of bleu in machine translation research. In EACL, Cited by: §2.1.
- Revolt: collaborative crowdsourcing for labeling machine learning datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2334–2346. Cited by: §2.3.
- Alloy: clustering with crowds and computation. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 3180–3191. Cited by: §3.3.
- Cicero: multi-turn, contextual argumentation for accurate crowdsourcing. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–14. Cited by: §2.3.
- Efficient elicitation approaches to estimate collective crowd answers. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 1–25. Cited by: §2.3, §3.2.
- Why rate when you could compare? using the “elochoice” package to assess pairwise comparisons of perceived physical strength. PLoS ONE 13. Cited by: §1, §2.2.
- Maximum likelihood estimation of observer error‐rates using the em algorithm. Journal of The Royal Statistical Society Series C-applied Statistics 28, pp. 20–28. Cited by: §2.3.
- Choosing the right evaluation for machine translation: an examination of annotator and automatic metric performance on human judgment tasks. Cited by: §2.1.
- Toward a learning science for complex crowdsourcing tasks. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. Cited by: §2.3.
- MicroTalk: using argumentation to improve crowdsourcing accuracy.. In Hcomp, pp. 32–41. Cited by: §2.3.
- Capturing ambiguity in crowdsourcing frame disambiguation. In HCOMP, Cited by: §1, §2.3.
- Crowdsourcing ground truth for medical relation extraction. ACM Transactions on Interactive Intelligent Systems (TiiS) 8, pp. 1 – 20. Cited by: §3.2.
- Crowdsourcing disagreement for collecting semantic annotation. In ESWC, Cited by: §2.3.
- SummEval: re-evaluating summarization evaluation. ArXiv abs/2007.12626. Cited by: §5.3.2.
- Interestingness prediction by robust learning to rank. In ECCV, Cited by: §4.2.3.
- Analyzing ordinal scales in studies of virtual environments: likert or lump it!. PRESENCE: Teleoperators and Virtual Environments 16, pp. 439–446. Cited by: §2.2.
- RealToxicityPrompts: evaluating neural toxic degeneration in language models. In EMNLP, Cited by: §1.
- Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. ArXiv abs/1908.07898. Cited by: §2.2.
- Explaining and harnessing adversarial examples. CoRR abs/1412.6572. Cited by: §2.4.
- The disagreement deconvolution: bringing machine learning performance metrics in line with reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Cited by: §2.3.
- Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, Sofia, Bulgaria, pp. 33–41. External Links: Cited by: §2.1.
On calibration of modern neural networks. ArXiv abs/1706.04599. Cited by: §1.
- CrowdVerge: predicting if people will agree on the answer to a visual question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 3511–3522. Cited by: §2.3.
- Crowdsourced detection of emotionally manipulative language. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. Cited by: §2.1.
- Search evaluation at google. Note: https://googleblog.blogspot.com/2008/09/search-evaluation-at-google.html Cited by: §2.1.
- Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. arXiv: Learning. Cited by: §2.2, §2.4.
- Rank aggregation via heterogeneous thurstone preference models. In AAAI, Cited by: §2.2.
- Embracing ambiguity: a comparison of annotation methodologies for crowdsourcing word sense labels. In HLT-NAACL, Cited by: §2.3.
- Parting crowds: characterizing divergent interpretations in crowdsourced annotation tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pp. 1637–1648. Cited by: §2.3.
- Genie: a leaderboard for human-in-the-loop evaluation of text generation. arXiv preprint arXiv:2101.06561. Cited by: §2.1.
- Capturing reliable fine-grained sentiment associations by crowdsourcing and best–worst scaling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 811–817. External Links: Cited by: §2.2, §2.3, §3.2.
- Structured labeling for facilitating concept evolution in machine learning. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3075–3084. Cited by: §2.3.
- Has machine translation achieved human parity? a case for document-level evaluation. ArXiv abs/1808.07048. Cited by: §3.2.
- Beyond user self-reported likert scale ratings: a comparison model for automatic dialog evaluation. In ACL, Cited by: §2.2.
- A technique for the measurement of attitudes.. Archives of psychology. Cited by: §2.2.
- To re(label), or not to re(label). In HCOMP, Cited by: §5.1.
Re-active learning: active learning with relabeling.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 1845–1852. Cited by: §2.3.
- Exploring iterative and parallel human computation processes. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP ’10, New York, NY, USA, pp. 68–76. External Links: Cited by: §5.2.
- Effective crowd annotation for relation extraction. In Proceedings of NAACL and HLT 2016, Cited by: §2.3, §3.4.
- Credbank: a large-scale social media corpus with associated credibility annotations. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 9. Cited by: §1, §2.1.
- NELA-gt-2018: a large multi-labelled news dataset for the study of misinformation in news articles. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 13, pp. 630–638. Cited by: §1.
- The measurement of meaning. University of Illinois press. Cited by: §2.2.
- Improving uncertainty estimates through the relationship with adversarial robustness. ArXiv abs/2006.16375. Cited by: §2.4.
- Know what you don’t know: unanswerable questions for squad. ArXiv abs/1806.03822. Cited by: §2.4.
- Efficient online scalar annotation with bounded support. ArXiv abs/1806.01170. Cited by: §2.2.
- Online hate interpretation varies by country, but more by individual: a statistical analysis using crowdsourced ratings. In 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 88–94. Cited by: §1, §4.2.1.
- Resolvable vs. irresolvable disagreement: a study on worker deliberation in crowd work. Proceedings of the ACM on Human-Computer Interaction 2 (CSCW), pp. 1–19. Cited by: §2.3.
- ChatEval: a tool for chatbot evaluation. In NAACL-HLT, Cited by: §2.1.
- Finding convincing arguments using scalable bayesian preference learning. Transactions of the Association for Computational Linguistics 6, pp. 357–371. Cited by: §3.1.
- Findings of the WMT 2020 shared task on machine translation robustness. In Proceedings of the Fifth Conference on Machine Translation, Online, pp. 76–91. External Links: Cited by: §2.1.
- Absolute identification by relative judgment.. Psychological review 112 4, pp. 881–911. Cited by: §4.5.1.
- Decision by sampling. Cited by: §3.1.
- A law of comparative judgment.. Psychological review 34 (4), pp. 273. Cited by: §2.2.
- Max-margin majority voting for learning from crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (10), pp. 2480–2494. External Links: Cited by: §2.3.
- Modifications of the multi stimulus test with hidden reference and anchor (mushra) for use in audiology. International Journal of Audiology 57, pp. S104 – S92. Cited by: §2.3, §3.1.
Similarity comparisons for interactive fine-grained categorization.
2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 859–866. Cited by: §3.1.
- The calibrated sigma method: an efficient remedy for between-group differences in response category use on likert scales. International Journal of Research in Marketing 33, pp. 944–960. Cited by: §1, §2.2.
- Metrology for ai: from benchmarks to instruments. arXiv preprint arXiv:1911.01875. Cited by: §1, §2.1, §2.2, §4.5.1.
- Ex machina: personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web, pp. 1391–1399. Cited by: §1, §2.1, §4.2.1.
- Ranking with uncertain labels. 2007 IEEE International Conference on Multimedia and Expo, pp. 96–99. Cited by: §2.4.
- Regression from uncertain labels and its applications to soft biometrics. IEEE Transactions on Information Forensics and Security 3, pp. 698–708. Cited by: §2.4.
- Almost an expert: the effects of rubrics and expertise on perceived value of crowdsourced design critiques. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, CSCW ’16, New York, NY, USA, pp. 1005–1017. External Links: Cited by: §2.3.
- Evaluation and refinement of clustered search results with the crowd. ACM Transactions on Interactive Intelligent Systems (TiiS) 8 (2), pp. 1–28. Cited by: §2.1.
- Quantifying facial age by posterior of age comparisons. ArXiv abs/1708.09687. Cited by: §3.1.