Log In Sign Up

DEXA: Supporting Non-Expert Annotators with Dynamic Examples from Experts

by   Markus Zlabinger, et al.
TU Wien

The success of crowdsourcing based annotation of text corpora depends on ensuring that crowdworkers are sufficiently well-trained to perform the annotation task accurately. To that end, a frequent approach to train annotators is to provide instructions and a few example cases that demonstrate how the task should be performed (referred to as the CONTROL approach). These globally defined "task-level examples", however, (i) often only cover the common cases that are encountered during an annotation task; and (ii) require effort from crowdworkers during the annotation process to find the most relevant example for the currently annotated sample. To overcome these limitations, we propose to support workers in addition to task-level examples, also with "task-instance level" examples that are semantically similar to the currently annotated data sample (referred to as Dynamic Examples for Annotation, DEXA). Such dynamic examples can be retrieved from collections previously labeled by experts, which are usually available as gold standard dataset. We evaluate DEXA on a complex task of annotating participants, interventions, and outcomes (known as PIO) in sentences of medical studies. The dynamic examples are retrieved using BioSent2Vec, an unsupervised semantic sentence similarity method specific to the biomedical domain. Results show that (i) workers of the DEXA approach reach on average much higher agreements (Cohen's Kappa) to experts than workers of the the CONTROL approach (avg. of 0.68 to experts in DEXA vs. 0.40 in CONTROL); (ii) already three per majority voting aggregated annotations of the DEXA approach reach substantial agreements to experts of 0.78/0.75/0.69 for P/I/O (in CONTROL 0.73/0.58/0.46). Finally, (iii) we acquire explicit feedback from workers and show that in the majority of cases (avg. 72


page 1

page 2

page 3

page 4


Capturing Ambiguity in Crowdsourcing Frame Disambiguation

FrameNet is a computational linguistics resource composed of semantic fr...

The Challenge of Variable Effort Crowdsourcing and How Visible Gold Can Help

We consider a class of variable effort human annotation tasks in which t...

Graph Mining Meets Crowdsourcing: Extracting Experts for Answer Aggregation

Aggregating responses from crowd workers is a fundamental task in the pr...

In Search of Ambiguity: A Three-Stage Workflow Design to Clarify Annotation Guidelines for Crowd Workers

We propose a novel three-stage FIND-RESOLVE-LABEL workflow for crowdsour...

Exploring Effectiveness of Inter-Microtask Qualification Tests in Crowdsourcing

Qualification tests in crowdsourcing are often used to pre-filter worker...

Toward Effective Automated Content Analysis via Crowdsourcing

Many computer scientists use the aggregated answers of online workers to...

Scalable Annotation of Fine-Grained Categories Without Experts

We present a crowdsourcing workflow to collect image annotations for vis...

1. Introduction

The success of crowdsourcing based annotation of text corpora depends on ensuring that crowdworkers are sufficiently well-trained to perform the annotation task accurately. Reaching a certain quality threshold is challenging, especially in tasks that require specific expertise to be performed (e.g. in the medical domain (Nye et al., 2018)).

The common approach to compensate the missing knowledge of individual non-expert workers is to train them via task instructions and a few example cases that demonstrate how the task should be performed (Nye et al., 2018; Snow et al., 2008) (referred to as the Control approach). These globally defined task-level examples, however, often (i) only cover the common cases that are encountered during an annotation task and (ii) require effort from crowdworkers during the annotation process to find the most relevant example for the currently annotated sample.

In this paper, we address these limitations with a new annotation approach called Dynamic Examples for Annotation (Dexa). In addition to task-level examples, annotators are supported with task-instance level examples that are semantically similar to the currently annotated sample. The task-instance examples are retrieved from data samples previously annotated by experts. Such expert samples are usually available since they are crucial to measure the quality of non-expert annotators (Snow et al., 2008; Daniel et al., 2018; Doroudi et al., 2016). We propose to split the expert samples into training samples from which dynamic examples are retrieved and test samples which are injected into the annotation process to measure worker performance.

We apply the Dexa approach on a task of the medical domain, known as the PIO111The difference to the PICO task is that Intervention/Control are not differentiated (Nye et al., 2018) task – where annotators label the Participants (P), Interventions (I), and Outcomes (O) in clinical trial reports. Specifically, we ask non-expert annotators to highlight the exact text phrases that describe either222To reduce overhead for workers, we split the PIO task into 3 individual sub-tasks. P, I, or O within the sentences of clinical trial reports. The trial reports used in our experiments stem from the EBM-Corpus (Nye et al., 2018), for which gold standard PIO labels are available. For the retrieval of dynamic examples, we use BioSent2Vec (Chen et al., 2019), an unsupervised semantic short-text similarity method specific to the biomedical domain.

We compare Dexa to the Control approach with respect to the annotation quality of individual workers and the annotation quality of aggregated (e.g. majority vote) redundant annotations from multiple workers. To measure the annotation quality of non-expert workers, we compute the inter-annotator agreement to the gold standard labels using Cohen’s Kappa. Our results show that (i) workers using the Dexa approach reach on average higher agreements to experts than workers using the Control approach (avg. of 0.68 in Dexa vs. 0.40 in Control); (ii) three per majority voting aggregated annotations of the Dexa approach already lead to substantial agreements to experts of 0.78/0.75/0.69 for P/I/O (in Control 0.73/0.58/0.46). Finally, (iii) we acquire explicit feedback from workers on the usefulness of the dynamic examples and show that in the majority of cases (avg. 72%) workers find the dynamic examples useful. For these useful examples, they reach a higher agreement to experts of (avg. over all PIO) than for other examples.

The contributions of this paper are:

  • We propose Dexa, a new annotation approach for the collection of high-quality annotations from non-experts.

  • We apply the approach to the complex PIO annotation task and show high agreements between non-experts and experts.

  • We make the collected crowdsourcing annotations and the code used for experiments available at

After discussing related work (Sec. 2), we describe the DEXA approach (Sec. 3) and its evaluation on the PIO task (Sec. 4 and 5).

2. Related Work

A common strategy to obtain higher quality labels from non-expert annotators is to redundantly collect annotations for each data sample, and then apply an aggregation method to create a final label that is of a higher quality than the individual labels (Sheng et al., 2008; Snow et al., 2008; Sabou et al., 2012). A simple aggregation method is to conduct a majority voting. More sophisticated methods aim to identify reliable annotators and weight their annotations as more important than the annotations of less reliable annotators (Dawid and Skene, 1979). Note that the Dexa approach can be combined with aggregation strategies, as we do in our experiments.

To improve the quality of individual annotators, several techniques are summarized by Daniel et al. (Daniel et al., 2018). For example, improving the worker motivation (e.g. higher payment), task simplification, providing constant feedback to workers, or filtering of unreliable workers. Besides these techniques, various annotation approaches are proposed. For example, Kobayashi et al. (Kobayashi et al., 2018) allow workers to change their annotation of a data sample after showing how other workers have annotated the sample. By examining the samples of other crowdworkers, a learning effect is induced in the crowdworkers increasing their accuracy for the annotation of future samples. Suzuki et al. (Suzuki et al., 2016) propose a system where inexperienced annotators can seek advice from experts, so-called mentors. Through mentoring, inexperienced annotators should obtain the skills that are required for a task and produce labels of high quality.

While the presented literature studies various aspects of improving the quality of individual non-expert annotators, little is known about how to effectively present demonstration examples (Doroudi et al., 2016) and whether such samples are effective in increasing the annotation quality. We give new insights into this topic in this study.

3. Dynamic Examples for Annotation

In this section, we describe our novel annotation approach, called Dynamic Examples for Annotation (Dexa). We show examples to annotators on a task-instance-level, i.e., dynamic to the current sample instance that is annotated. Given a set of labeled expert samples and a set of samples to be labeled by non-experts, the Dexa annotation approach consists of following steps:

  1. The samples of are divided into a test set and a training set , where . From the training set, the dynamic examples are drawn. The samples from the test set are injected into to measure the quality of the non-expert annotators, resulting in the annotation set .

  2. An unsupervised similarity method is selected to compute the semantic similarity between a sample of the training set to a sample of the annotation set. The similarity method should be selected based on the task at hand. For example, in our experiments, samples are sentences, and therefore, we use a semantic sentence-to-sentence similarity method, as described in Section 4.3.

  3. The annotation set is labeled by non-experts. For each unlabeled sample , the similarity method is used to compute the similarity to each sample in the training set, i.e, . Then, the top most similar samples are shown as dynamic demonstration examples to the annotators.

  4. Finally, the accuracy of non-expert annotators is compared to that of expert annotators based on the test samples that were injected into the annotation set in step (1).

4. Evaluating DEXA on PIO tasks

In the PIO annotation task (Huang et al., 2006; Nye et al., 2018), annotations are collected for the Participants (e.g., ”patients with headache”), Interventions (e.g., ”ibuprofen”), and Outcomes (e.g., ”pain reduction”) of medical studies. Due to the complexity of this task, PIO annotations were initially only annotated on a binary sentence level (Kim et al., 2011), where a sentence was labeled whether it contained a P, I, or O. Recently, fine-grained text span annotations were collected (Nye et al., 2018), with annotators highlighting the exact text phrases within a sentence that describe P, I, or O. However, using standard task-level training for this task resulted in non-expert workers reaching only weak agreements compared to experts (Nye et al., 2018). To evaluate Dexa, we apply it in the setting of (Nye et al., 2018) where we augment task-level examples with dynamic task-instance level examples.

4.1. Dataset

We consider the 191 clinical trial reports of the EBM-NLP corpus (Nye et al., 2018), where for each trial and PIO element gold standard labels are available. The reports originate from PubMed and consist of a title and an abstract. As preprocessing steps, we use the Stanford CoreNLP (Manning et al., 2014) to segment and NLTK (Bird et al., 2009) to tokenize the sentences. Next, we split the 191 reports into test set for evaluation (41 reports with 426 sentences) and training set (150 reports with 1,636 sentences), from which dynamic examples are retrieved for the Dexa approach. Note that the test sentences are usually injected into a much larger set for which no gold labels are available (see Step (1) in Section 3); however, in this study, we aim to evaluate our annotation approach and therefore only sentences are annotated that overlap with the gold standard.

4.2. Annotation Setup

We follow the annotation setup described in (Nye et al., 2018) with crowdworkers hired from the Amazon Mechanical Turk (AMT). Annotations for P, I and O are divided into three individual sub-tasks to reduce the cognitive overhead for workers. For each sub-task, annotation instructions and a few task-level examples are provided to workers, available as an appendix in (Nye et al., 2018). Workers are allowed to participate in one of the sub-tasks if their work approval rate for previous tasks is at least 90%. A small-scale test run is performed to filter out spammers and workers who do not follow the task instructions. Workers who pass the test run qualify for the full-scale run.

4.3. Dexa Approach

Within the annotation setup described above, we apply the Dexa approach to collect non-expert labels for the 426 test sentences. We develop an annotation interface that can be embedded as a design layout in the AMT platform. In each HIT, we ask workers to annotate, depending on the sub-task, either P, I, or O within a sentence. For each sentence, we present three dynamic examples (), and we acquire feedback from workers on whether they found at least one of these examples useful to support their annotation work.

The dynamic examples that we show to support annotators are retrieved from the training set using the sentence embedding model BioSent2Vec (Chen et al., 2019)

. Specifically, we compute the cosine similarity

, where refers to the BioSent2Vec embedding of the sentences and . We use BioSent2Vec since (i) it is the state-of-the-art for various short-text similarity tasks in the biomedical domain, and (ii) a pre-trained model is available333 trained on PubMed (Chen et al., 2019), which is the same underlying data source as the clinical trial reports used in our study.

S: We performed a randomized, controlled study comparing the prophylactic effects of capsule forms of fluconazole (n=110) and itraconazole (n=108) in patients with acute myeloid leukemia (AML) or myelodysplastic syndromes (MDS) during and after chemotherapy.
D: A randomized, double-blind, placebo-controlled study on the immediate clinical and microbiological efficacy of doxycycline (100mg for 14 days) was carried out to determine the benefit of adjunctive medication in 16 patients with localized juvenile periodontitis.
S: Adverse events did not significantly differ in the 2 groups.
D: There were no serious adverse events.
S: The majority (63%) of the project group had no admission during the 10 month study period.
D: Referral occurred at any stage of the patients’ EECU admission.
Table 1. For three samples of the test set (S), we show the most similar dynamic example (D). Gold labels are highlighted for Participants, Interventions, and Outcomes in all sentences. Note that to workers only the labels for either P, I, or O (depending on the sub-task) within the dynamic examples are visible.

In Table 1, we illustrate three sample sentences of the test set and the corresponding most similar dynamic example of the training set. The first case shows that the dynamic example provides strong support in annotating P, I and, O – even though the sentence is rather complex and long. The middle case shows a dynamic example that provides support in annotating the O element. Finally, the last case shows that no appropriate dynamic example is found for the sample. In such cases, workers need to decide independently.

4.4. Control Approach

We compare annotations obtained via the Dexa approach to the non-expert annotations that were previously obtained in the scope of (Nye et al., 2018) using the Control approach444Annotations of individual crowdworkers were downloaded from Note that we decided to re-use the available annotations rather than re-collecting them, since the same annotation setup is followed in both approaches (Section 4.2). Although a different annotation interface was considered by (Nye et al., 2018), the interaction component of clicking a start and end word to annotate a P, I, or O text phrase is identical in both approaches.

5. Results

We compare the Dexa approach to the Control approach based on the 426 sentences of the test set. For the Control approach, at least 3 redundant non-expert annotations are available per sentence; although, the average is . For the Dexa approach, we collect exactly 3 redundant annotations per sentence, resulting in a total of sentence annotations per PIO sub-task. In total, 26 workers contributed in annotating the test set using the Dexa approach. In contrast, 403 workers contributed for the Control approach  (Nye et al., 2018), because of (i) the goal of collecting more redundant samples and (ii) an additional goal of labelling clinical trials only by non-experts.

Agreement of Individuals

We compare annotations obtained by Dexa and Control based on the inter-annotator agreement to the gold standard annotations using Cohen’s Kappa. To eliminate random noise of workers who labeled only a few sentences, we do not analyze workers of Dexa and Control who labeled less than 5% of the total of 426 test sentences ( sentences).

The results in Figure 1 show that the median agreement to experts is substantially higher in the Dexa approach than in the Control approach, especially for I and O. Notice that the kappa scores of workers of the Control

approach range from 0.0 to nearly 1.0, probably affected by the higher number of redundantly collected labels. Notable is also the one worker of the

Dexa approach who underperformed in the I sub-task compared to the other workers, illustrated as a dot.

Figure 1. Cohen’s between annotations of individual non-expert annotators compared to the gold standard.

Agreement of Aggregations

We analyze quality of aggregated annotations obtained with two aggregation strategies: majority voting (MV), which weights individual workers equally, and the Dawid-Skene (DS) model (Dawid and Skene, 1979), which recognizes reliable annotators and weights them more strongly in the aggregation procedure. For Dexa, we aggregate the 3 available labels, while for Control, we create aggregations from different numbers of labels , as follows: (i) for each task-instance, we randomly pick annotations from the available redundant ones; (ii) we repeat step (i) 20 times and compute the agreement to the gold standard label at each iteration; (iii) the 20 -values are averaged to compute the final .

The results in Table 2 show that the score between non-experts and experts is in almost all cases higher for aggregated annotations obtained with Dexa compared to Control. Even when 6, 9 or all labels of the Control approach are aggregated, the 3 aggregated annotations of the Dexa approach reach a higher agreement to the gold standard for the I and O sub-tasks. Only for P, we observed that the aggregation of all redundant Control annotations surpasses the score of our Dexa approach.

Notable is the effectiveness of the DS aggregation for the redundant labels of the Control approach. Especially for P, the high agreements of individual workers using the Control approach (Figure 1) leads to a strong aggregated result via DS (Table 2). No improvements are observed when using DS over MV to aggregate the 3 redundant Dexa labels, which is expected since the noise of individual annotators is low (as shown in Figure 1).

Cohen’s Kappa ()
Dexa 0.780 0.757 0.694
Control 0.702 0.455 0.352
Control 0.729 0.465 0.342
Control 0.749 0.454 0.307
Control 0.746 0.457 0.311
Dexa 0.776 0.756 0.694
Control 0.729 0.579 0.458
Control 0.809 0.644 0.614
Control 0.841 0.629 0.659
Control 0.867 0.633 0.677
Table 2. Cohen’s between the gold standard and non-expert annotations aggregated for different numbers of annotations via Majority Vote (MV) and Dawid-Skene (DS).

Worker Feedback

We analyze the feedback from workers on the usefulness of the dynamic examples in Table 3. The result show a high percentage of positive answers for all annotation tasks, especially for the more difficult tasks I (78%) and O (76%). Additionally, the (perceived) usefulness of the examples has an effect on the quality of the annotations. Indeed, the averaged agreements of individual annotators (excluding workers who annotated less than 5% of the 426 test sentences) to the gold standard labels is on average much higher when the dynamic examples were found useful than otherwise.

Feedback Percentage Cohen’s Kappa ()
Useful 64% 78% 76% 0.73 0.67 0.60
Not useful 36% 22% 24% 0.42 0.41 0.44
Table 3. Percentage of workers finding dynamic examples useful; average scores (std. deviation) to the gold standard.

6. Conclusion

We presented the Dexa annotation approach in which non-expert annotators are supported not only by task level annotation examples (as in Control) but also by dynamic, task-instance level examples that are semantically similar to the currently annotated sample. Evaluating Dexa on the PIO task lead to: (i) improved quality of individual annotations: individual annotator agreement with expert annotations was on average higher for Dexa than Control; (ii) improved aggregated label quality: three per majority voting aggregated annotations of the Dexa approach reached on average higher agreements to experts than in the Control approach; (iii) explicit validation of dynamic example usefulness: workers found the proposed examples useful in the majority of cases (avg. 73% over PIO tasks) and label quality was consistently higher for cases when the examples were judged useful than otherwise.

As future work we will (i) optimize the parameter ; (ii) investigate the effectiveness of different similarity methods for selecting examples through A/B testing, and (iii) evaluate Dexa on different domains and annotation tasks.


  • S. Bird, E. Klein, and E. Loper (2009) Natural language processing with Python: analyzing text with the natural language toolkit. ”O’Reilly Media”. Cited by: §4.1.
  • Q. Chen, Y. Peng, and Z. Lu (2019) BioSentVec: creating sentence embeddings for biomedical texts. In Proc. of ICHI, Cited by: §1, §4.3.
  • F. Daniel, P. Kucherbaev, C. Cappiello, B. Benatallah, and M. Allahbakhsh (2018) Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. ACM Computing Surveys (CSUR) 51 (1), pp. 7:1–7:40. External Links: ISSN 0360-0300 Cited by: §1, §2.
  • A. P. Dawid and A. M. Skene (1979)

    Maximum likelihood estimation of observer error-rates using the EM algorithm

    Journal of the Royal Statistical Society: Series C (Applied Statistics) 28 (1), pp. 20–28. Cited by: §2, §5.
  • S. Doroudi, E. Kamar, E. Brunskill, and E. Horvitz (2016) Toward a Learning Science for Complex Crowdsourcing Tasks. In Proc. of CHI, Cited by: §1, §2.
  • X. Huang, J. Lin, and D. Demner-Fushman (2006) Evaluation of PICO as a Knowledge Representation for Clinical Questions. Proc. of AMIA. Cited by: §4.
  • S. N. Kim, D. Martinez, L. Cavedon, and L. Yencken (2011) Automatic classification of sentences to support Evidence Based Medicine. BMC Bioinformatics 12 (2), pp. S5. External Links: ISSN 1471-2105 Cited by: §4.
  • M. Kobayashi, H. Morita, M. Matsubara, N. Shimizu, and A. Morishima (2018) An empirical study on short-and long-term effects of self-correction in crowdsourced microtasks. In Proc. of AAAI, Cited by: §2.
  • C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky (2014) The Stanford CoreNLP Natural Language Processing Toolkit. In Proc. of ACL, Cited by: §4.1.
  • B. Nye, J. J. Li, R. Patel, Y. Yang, I. Marshall, A. Nenkova, and B. Wallace (2018) A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature. In Proc. of ACL, Cited by: §1, §1, §1, §4.1, §4.2, §4.4, §4, §5, footnote 1.
  • M. Sabou, K. Bontcheva, and A. Scharl (2012) Crowdsourcing research opportunities: lessons from natural language processing. In Proc. of i-KNOW, Cited by: §2.
  • V. S. Sheng, F. Provost, and P. G. Ipeirotis (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In Proc. of SIGKDD, Cited by: §2.
  • R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng (2008) Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proc. of EMNLP, Cited by: §1, §1, §2.
  • R. Suzuki, N. Salehi, M. S. Lam, J. C. Marroquin, and M. S. Bernstein (2016) Atelier: Repurposing expert crowdsourcing tasks as micro-internships. In Proc. of CHI, Cited by: §2.