The success of crowdsourcing based annotation of text corpora depends on ensuring that crowdworkers are sufficiently well-trained to perform the annotation task accurately. Reaching a certain quality threshold is challenging, especially in tasks that require specific expertise to be performed (e.g. in the medical domain (Nye et al., 2018)).
The common approach to compensate the missing knowledge of individual non-expert workers is to train them via task instructions and a few example cases that demonstrate how the task should be performed (Nye et al., 2018; Snow et al., 2008) (referred to as the Control approach). These globally defined task-level examples, however, often (i) only cover the common cases that are encountered during an annotation task and (ii) require effort from crowdworkers during the annotation process to find the most relevant example for the currently annotated sample.
In this paper, we address these limitations with a new annotation approach called Dynamic Examples for Annotation (Dexa). In addition to task-level examples, annotators are supported with task-instance level examples that are semantically similar to the currently annotated sample. The task-instance examples are retrieved from data samples previously annotated by experts. Such expert samples are usually available since they are crucial to measure the quality of non-expert annotators (Snow et al., 2008; Daniel et al., 2018; Doroudi et al., 2016). We propose to split the expert samples into training samples from which dynamic examples are retrieved and test samples which are injected into the annotation process to measure worker performance.
We apply the Dexa approach on a task of the medical domain, known as the PIO111The difference to the PICO task is that Intervention/Control are not differentiated (Nye et al., 2018) task – where annotators label the Participants (P), Interventions (I), and Outcomes (O) in clinical trial reports. Specifically, we ask non-expert annotators to highlight the exact text phrases that describe either222To reduce overhead for workers, we split the PIO task into 3 individual sub-tasks. P, I, or O within the sentences of clinical trial reports. The trial reports used in our experiments stem from the EBM-Corpus (Nye et al., 2018), for which gold standard PIO labels are available. For the retrieval of dynamic examples, we use BioSent2Vec (Chen et al., 2019), an unsupervised semantic short-text similarity method specific to the biomedical domain.
We compare Dexa to the Control approach with respect to the annotation quality of individual workers and the annotation quality of aggregated (e.g. majority vote) redundant annotations from multiple workers. To measure the annotation quality of non-expert workers, we compute the inter-annotator agreement to the gold standard labels using Cohen’s Kappa. Our results show that (i) workers using the Dexa approach reach on average higher agreements to experts than workers using the Control approach (avg. of 0.68 in Dexa vs. 0.40 in Control); (ii) three per majority voting aggregated annotations of the Dexa approach already lead to substantial agreements to experts of 0.78/0.75/0.69 for P/I/O (in Control 0.73/0.58/0.46). Finally, (iii) we acquire explicit feedback from workers on the usefulness of the dynamic examples and show that in the majority of cases (avg. 72%) workers find the dynamic examples useful. For these useful examples, they reach a higher agreement to experts of (avg. over all PIO) than for other examples.
The contributions of this paper are:
We propose Dexa, a new annotation approach for the collection of high-quality annotations from non-experts.
We apply the approach to the complex PIO annotation task and show high agreements between non-experts and experts.
We make the collected crowdsourcing annotations and the code used for experiments available at https://github.com/Markus-Zlabinger/pico-annotation
2. Related Work
A common strategy to obtain higher quality labels from non-expert annotators is to redundantly collect annotations for each data sample, and then apply an aggregation method to create a final label that is of a higher quality than the individual labels (Sheng et al., 2008; Snow et al., 2008; Sabou et al., 2012). A simple aggregation method is to conduct a majority voting. More sophisticated methods aim to identify reliable annotators and weight their annotations as more important than the annotations of less reliable annotators (Dawid and Skene, 1979). Note that the Dexa approach can be combined with aggregation strategies, as we do in our experiments.
To improve the quality of individual annotators, several techniques are summarized by Daniel et al. (Daniel et al., 2018). For example, improving the worker motivation (e.g. higher payment), task simplification, providing constant feedback to workers, or filtering of unreliable workers. Besides these techniques, various annotation approaches are proposed. For example, Kobayashi et al. (Kobayashi et al., 2018) allow workers to change their annotation of a data sample after showing how other workers have annotated the sample. By examining the samples of other crowdworkers, a learning effect is induced in the crowdworkers increasing their accuracy for the annotation of future samples. Suzuki et al. (Suzuki et al., 2016) propose a system where inexperienced annotators can seek advice from experts, so-called mentors. Through mentoring, inexperienced annotators should obtain the skills that are required for a task and produce labels of high quality.
While the presented literature studies various aspects of improving the quality of individual non-expert annotators, little is known about how to effectively present demonstration examples (Doroudi et al., 2016) and whether such samples are effective in increasing the annotation quality. We give new insights into this topic in this study.
3. Dynamic Examples for Annotation
In this section, we describe our novel annotation approach, called Dynamic Examples for Annotation (Dexa). We show examples to annotators on a task-instance-level, i.e., dynamic to the current sample instance that is annotated. Given a set of labeled expert samples and a set of samples to be labeled by non-experts, the Dexa annotation approach consists of following steps:
The samples of are divided into a test set and a training set , where . From the training set, the dynamic examples are drawn. The samples from the test set are injected into to measure the quality of the non-expert annotators, resulting in the annotation set .
An unsupervised similarity method is selected to compute the semantic similarity between a sample of the training set to a sample of the annotation set. The similarity method should be selected based on the task at hand. For example, in our experiments, samples are sentences, and therefore, we use a semantic sentence-to-sentence similarity method, as described in Section 4.3.
The annotation set is labeled by non-experts. For each unlabeled sample , the similarity method is used to compute the similarity to each sample in the training set, i.e, . Then, the top most similar samples are shown as dynamic demonstration examples to the annotators.
Finally, the accuracy of non-expert annotators is compared to that of expert annotators based on the test samples that were injected into the annotation set in step (1).
4. Evaluating DEXA on PIO tasks
In the PIO annotation task (Huang et al., 2006; Nye et al., 2018), annotations are collected for the Participants (e.g., ”patients with headache”), Interventions (e.g., ”ibuprofen”), and Outcomes (e.g., ”pain reduction”) of medical studies. Due to the complexity of this task, PIO annotations were initially only annotated on a binary sentence level (Kim et al., 2011), where a sentence was labeled whether it contained a P, I, or O. Recently, fine-grained text span annotations were collected (Nye et al., 2018), with annotators highlighting the exact text phrases within a sentence that describe P, I, or O. However, using standard task-level training for this task resulted in non-expert workers reaching only weak agreements compared to experts (Nye et al., 2018). To evaluate Dexa, we apply it in the setting of (Nye et al., 2018) where we augment task-level examples with dynamic task-instance level examples.
We consider the 191 clinical trial reports of the EBM-NLP corpus (Nye et al., 2018), where for each trial and PIO element gold standard labels are available. The reports originate from PubMed and consist of a title and an abstract. As preprocessing steps, we use the Stanford CoreNLP (Manning et al., 2014) to segment and NLTK (Bird et al., 2009) to tokenize the sentences. Next, we split the 191 reports into test set for evaluation (41 reports with 426 sentences) and training set (150 reports with 1,636 sentences), from which dynamic examples are retrieved for the Dexa approach. Note that the test sentences are usually injected into a much larger set for which no gold labels are available (see Step (1) in Section 3); however, in this study, we aim to evaluate our annotation approach and therefore only sentences are annotated that overlap with the gold standard.
4.2. Annotation Setup
We follow the annotation setup described in (Nye et al., 2018) with crowdworkers hired from the Amazon Mechanical Turk (AMT). Annotations for P, I and O are divided into three individual sub-tasks to reduce the cognitive overhead for workers. For each sub-task, annotation instructions and a few task-level examples are provided to workers, available as an appendix in (Nye et al., 2018). Workers are allowed to participate in one of the sub-tasks if their work approval rate for previous tasks is at least 90%. A small-scale test run is performed to filter out spammers and workers who do not follow the task instructions. Workers who pass the test run qualify for the full-scale run.
4.3. Dexa Approach
Within the annotation setup described above, we apply the Dexa approach to collect non-expert labels for the 426 test sentences. We develop an annotation interface that can be embedded as a design layout in the AMT platform. In each HIT, we ask workers to annotate, depending on the sub-task, either P, I, or O within a sentence. For each sentence, we present three dynamic examples (), and we acquire feedback from workers on whether they found at least one of these examples useful to support their annotation work.
The dynamic examples that we show to support annotators are retrieved from the training set using the sentence embedding model BioSent2Vec (Chen et al., 2019)
. Specifically, we compute the cosine similarity, where refers to the BioSent2Vec embedding of the sentences and . We use BioSent2Vec since (i) it is the state-of-the-art for various short-text similarity tasks in the biomedical domain, and (ii) a pre-trained model is available333https://github.com/ncbi-nlp/BioSentVec trained on PubMed (Chen et al., 2019), which is the same underlying data source as the clinical trial reports used in our study.
|S: We performed a randomized, controlled study comparing the prophylactic effects of capsule forms of fluconazole (n=110) and itraconazole (n=108) in patients with acute myeloid leukemia (AML) or myelodysplastic syndromes (MDS) during and after chemotherapy.|
|D: A randomized, double-blind, placebo-controlled study on the immediate clinical and microbiological efficacy of doxycycline (100mg for 14 days) was carried out to determine the benefit of adjunctive medication in 16 patients with localized juvenile periodontitis.|
|S: Adverse events did not significantly differ in the 2 groups.|
|D: There were no serious adverse events.|
|S: The majority (63%) of the project group had no admission during the 10 month study period.|
|D: Referral occurred at any stage of the patients’ EECU admission.|
In Table 1, we illustrate three sample sentences of the test set and the corresponding most similar dynamic example of the training set. The first case shows that the dynamic example provides strong support in annotating P, I and, O – even though the sentence is rather complex and long. The middle case shows a dynamic example that provides support in annotating the O element. Finally, the last case shows that no appropriate dynamic example is found for the sample. In such cases, workers need to decide independently.
4.4. Control Approach
We compare annotations obtained via the Dexa approach to the non-expert annotations that were previously obtained in the scope of (Nye et al., 2018) using the Control approach444Annotations of individual crowdworkers were downloaded from https://github.com/bepnye/EBM-NLP. Note that we decided to re-use the available annotations rather than re-collecting them, since the same annotation setup is followed in both approaches (Section 4.2). Although a different annotation interface was considered by (Nye et al., 2018), the interaction component of clicking a start and end word to annotate a P, I, or O text phrase is identical in both approaches.
We compare the Dexa approach to the Control approach based on the 426 sentences of the test set. For the Control approach, at least 3 redundant non-expert annotations are available per sentence; although, the average is . For the Dexa approach, we collect exactly 3 redundant annotations per sentence, resulting in a total of sentence annotations per PIO sub-task. In total, 26 workers contributed in annotating the test set using the Dexa approach. In contrast, 403 workers contributed for the Control approach (Nye et al., 2018), because of (i) the goal of collecting more redundant samples and (ii) an additional goal of labelling clinical trials only by non-experts.
Agreement of Individuals
We compare annotations obtained by Dexa and Control based on the inter-annotator agreement to the gold standard annotations using Cohen’s Kappa. To eliminate random noise of workers who labeled only a few sentences, we do not analyze workers of Dexa and Control who labeled less than 5% of the total of 426 test sentences ( sentences).
The results in Figure 1 show that the median agreement to experts is substantially higher in the Dexa approach than in the Control approach, especially for I and O. Notice that the kappa scores of workers of the Control
approach range from 0.0 to nearly 1.0, probably affected by the higher number of redundantly collected labels. Notable is also the one worker of theDexa approach who underperformed in the I sub-task compared to the other workers, illustrated as a dot.
Agreement of Aggregations
We analyze quality of aggregated annotations obtained with two aggregation strategies: majority voting (MV), which weights individual workers equally, and the Dawid-Skene (DS) model (Dawid and Skene, 1979), which recognizes reliable annotators and weights them more strongly in the aggregation procedure. For Dexa, we aggregate the 3 available labels, while for Control, we create aggregations from different numbers of labels , as follows: (i) for each task-instance, we randomly pick annotations from the available redundant ones; (ii) we repeat step (i) 20 times and compute the agreement to the gold standard label at each iteration; (iii) the 20 -values are averaged to compute the final .
The results in Table 2 show that the score between non-experts and experts is in almost all cases higher for aggregated annotations obtained with Dexa compared to Control. Even when 6, 9 or all labels of the Control approach are aggregated, the 3 aggregated annotations of the Dexa approach reach a higher agreement to the gold standard for the I and O sub-tasks. Only for P, we observed that the aggregation of all redundant Control annotations surpasses the score of our Dexa approach.
Notable is the effectiveness of the DS aggregation for the redundant labels of the Control approach. Especially for P, the high agreements of individual workers using the Control approach (Figure 1) leads to a strong aggregated result via DS (Table 2). No improvements are observed when using DS over MV to aggregate the 3 redundant Dexa labels, which is expected since the noise of individual annotators is low (as shown in Figure 1).
|Cohen’s Kappa ()|
We analyze the feedback from workers on the usefulness of the dynamic examples in Table 3. The result show a high percentage of positive answers for all annotation tasks, especially for the more difficult tasks I (78%) and O (76%). Additionally, the (perceived) usefulness of the examples has an effect on the quality of the annotations. Indeed, the averaged agreements of individual annotators (excluding workers who annotated less than 5% of the 426 test sentences) to the gold standard labels is on average much higher when the dynamic examples were found useful than otherwise.
|Feedback||Percentage||Cohen’s Kappa ()|
We presented the Dexa annotation approach in which non-expert annotators are supported not only by task level annotation examples (as in Control) but also by dynamic, task-instance level examples that are semantically similar to the currently annotated sample. Evaluating Dexa on the PIO task lead to: (i) improved quality of individual annotations: individual annotator agreement with expert annotations was on average higher for Dexa than Control; (ii) improved aggregated label quality: three per majority voting aggregated annotations of the Dexa approach reached on average higher agreements to experts than in the Control approach; (iii) explicit validation of dynamic example usefulness: workers found the proposed examples useful in the majority of cases (avg. 73% over PIO tasks) and label quality was consistently higher for cases when the examples were judged useful than otherwise.
As future work we will (i) optimize the parameter ; (ii) investigate the effectiveness of different similarity methods for selecting examples through A/B testing, and (iii) evaluate Dexa on different domains and annotation tasks.
- Natural language processing with Python: analyzing text with the natural language toolkit. ”O’Reilly Media”. Cited by: §4.1.
- BioSentVec: creating sentence embeddings for biomedical texts. In Proc. of ICHI, Cited by: §1, §4.3.
- Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. ACM Computing Surveys (CSUR) 51 (1), pp. 7:1–7:40. External Links: Cited by: §1, §2.
Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28 (1), pp. 20–28. Cited by: §2, §5.
- Toward a Learning Science for Complex Crowdsourcing Tasks. In Proc. of CHI, Cited by: §1, §2.
- Evaluation of PICO as a Knowledge Representation for Clinical Questions. Proc. of AMIA. Cited by: §4.
- Automatic classification of sentences to support Evidence Based Medicine. BMC Bioinformatics 12 (2), pp. S5. External Links: Cited by: §4.
- An empirical study on short-and long-term effects of self-correction in crowdsourced microtasks. In Proc. of AAAI, Cited by: §2.
- The Stanford CoreNLP Natural Language Processing Toolkit. In Proc. of ACL, Cited by: §4.1.
- A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature. In Proc. of ACL, Cited by: §1, §1, §1, §4.1, §4.2, §4.4, §4, §5, footnote 1.
- Crowdsourcing research opportunities: lessons from natural language processing. In Proc. of i-KNOW, Cited by: §2.
- Get another label? improving data quality and data mining using multiple, noisy labelers. In Proc. of SIGKDD, Cited by: §2.
- Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proc. of EMNLP, Cited by: §1, §1, §2.
- Atelier: Repurposing expert crowdsourcing tasks as micro-internships. In Proc. of CHI, Cited by: §2.