Crowdsourcing a High-Quality Gold Standard for QA-SRL

11/08/2019 ∙ by Paul Roit, et al. ∙ 0

Question-answer driven Semantic Role Labeling (QA-SRL) has been proposed as an attractive open and natural form of SRL, easily crowdsourceable for new corpora. Recently, a large-scale QA-SRL corpus and a trained parser were released, accompanied by a densely annotated dataset for evaluation. Trying to replicate the QA-SRL annotation and evaluation scheme for new texts, we observed that the resulting annotations were lacking in quality and coverage, particularly insufficient for creating gold standards for evaluation. In this paper, we present an improved QA-SRL annotation protocol, involving crowd-worker selection and training, followed by data consolidation. Applying this process, we release a new gold evaluation dataset for QA-SRL, yielding more consistent annotations and greater coverage. We believe that our new annotation protocol and gold standard will facilitate future replicable research of natural semantic annotations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic Role Labeling (SRL) provides explicit annotation of predicate-argument relations, which have been found useful in various downstream tasks Shen and Lapata (2007); Chen et al. (2013); Wang et al. (2015); Marcheggiani et al. (2018). Question-Answer driven Semantic Role Labeling (QA-SRL) He et al. (2015) is an SRL scheme in which roles are captured by natural language questions, while arguments represent their answers, making the annotations intuitive, semantically rich, and easily attainable by laymen. For example, in Table 1, the question Who cut something captures the traditional “agent” role.

Previous attempts to annotate QA-SRL initially involved trained annotators He et al. (2015) but later resorted to crowdsourcing Fitzgerald et al. (2018) to achieve scalability. Naturally, employing crowd workers raises challenges when annotating semantic structures like SRL. As Fitzgerald et al. (2018) acknowledged, the main shortage of the large-scale 2018

dataset is the lack of recall, estimated by experts to be in the lower 70s.

In light of this and other annotation inconsistencies, we propose an improved QA-SRL crowdsourcing protocol for high-quality annotation, allowing for substantially more reliable performance evaluation of QA-SRL parsers. To address worker quality, we systematically screen workers, provide concise yet effective guidelines, and perform a short training procedure, all within a crowd-sourcing platform. To address coverage, we employ two independent workers plus an additional one for consolidation — similar to conventional expert-annotation practices. In addition to yielding 25% more roles, our coverage gain is demonstrated by evaluating against expertly annotated data and comparison with PropBank (Section 4). To foster future research, we release an assessed high-quality gold dataset along with our reproducible protocol and evaluation scheme, and report the performance of the existing parser Fitzgerald et al. (2018) as a baseline.

2 Background — QA-SRL


In QA-SRL, a role question adheres to a 7-slot template, with slots corresponding to a WH-word, the verb, auxiliaries, argument placeholders (SUBJ, OBJ), and prepositions, where some slots are optional He et al. (2015) (see appendix for examples). Such question captures the corresponding semantic role with a natural easily understood expression. The set of all non-overlapping answers for the question is then considered as the set of arguments associated with that role. This broad question-based definition of roles captures traditional cases of syntactically-linked arguments, but also additional semantic arguments clearly implied by the sentence meaning (see example (2) in Table 1).


The original 2015 QA-SRL dataset He et al. (2015) was annotated by non-expert workers after completing a brief training procedure. They annotated 7.8K verbs, reporting an average of 2.4 QA pairs per predicate. Even though multiple annotators were shown to produce greater coverage, their released dataset was produced using only a single annotator per verb. In subsequent work, Fitzgerald et al. (2018) constructed a large-scale corpus and used it to train a parser.111 They crowdsourced 133K verbs with 2.0 QA pairs per verb on average. Since crowd-workers had no prior training, quality was established using an additional validation step, where workers had to ascertain the validity of the question, but not of its answers. Instead, the validator provided additional answers, independent of the other annotators. Each verb in the corpus was annotated by a single QA-generating worker and validated by two others.

In a reserved part of the corpus (Dense), targeted for parser evaluation, verbs were densely validated with 5 workers, approving questions judged as valid by at least 4/5 validators. Notably, adding validators to the Dense annotation pipeline accounts mostly for precision errors, while role coverage solely relies upon the single generator’s set of questions. As both 2015 and 2018 datasets use a single question generator, both struggle with maintaining coverage. Also noteworthy, is that while traditional SRL annotations contain a single authoritative and non-redundant annotation, the 2018 dataset provides the raw annotations of all annotators. These include many overlapping or noisy answers, without settling on consolidation procedures to provide a single gold reference.

We found that these characteristics of the dataset impede its utility for future development of parsers.

Around 47 people could be arrested, including the councillor.
(1) Who might be arrested? 47 people | the councillor
Perry called for the DA’s resignation, and when she did not resign, cut funding to a program she ran.
(2) Why was something cut by someone? she did not resign
(3) Who cut something? Perry
Table 1: Running examples of QA-SRL annotations; this set is a sample of the possible questions that can be asked. The bar (|) separates multiple selected answers.

3 Annotation and Evaluation Methods

3.1 Crowdsourcing Methodology

Screening and Training

Our pool of annotators is selected after several short training rounds, with up to 15 predicates per round, in which they received extensive personal feedback. 1 out of 3 participants were selected after exhibiting good performance, tested against expert annotations.


We adopt the annotation machinery of Fitzgerald et al. (2018) implemented using Amazon’s Mechanical Turk,222 and annotate each predicate by 2 trained workers independently, while a third consolidates their annotations into a final set of roles and arguments. In this consolidation task, the worker validates questions, merges, splits or modifies answers for the same role according to guidelines, and removes redundant roles by picking the more naturally phrased questions. For example, in Table 1 ex. 1, one worker could have chosen “47 people”, while another chose “the councillor”; in this case the consolidator would include both of those answers. In Section 4, we show that this process yields better coverage.333While our consolidator views two full QA sets, the validator from Fitzgerald et al. (2018) viewed only the questions of a single generator. For example annotations, please refer to the appendix.

Guidelines Refinements

We refine the previous guidelines by emphasizing several semantic features: correctly using modal verbs and negations in the question, and choosing answers that coincide with a single entity (example 1 in Table 1).

Data & Cost

We annotated a sample taken from the Dense set on Wikinews and Wikipedia domains, each with 1000 sentences, equally divided between development and test. QA generating annotators are paid the same as in fitz2018qasrl, while the consolidator is rewarded 5¢ per verb and 3¢ per question. Per predicate, on average, our cost is 54.2¢, yielding 2.9 roles, compared to reported 2.3 valid roles with an approximated cost of 51¢ per predicate for Dense.

3.2 Evaluation Metrics

Evaluation in QA-SRL involves aligning predicted and ground truth argument spans and evaluating role label equivalence. Since detecting question paraphrases is still an open challenge, we propose both unlabeled and labeled evaluation metrics.

Unlabeled Argument Detection (UA) Inspired by the method presented in Fitzgerald et al. (2018), arguments are matched using a span matching criterion of intersection over union . To credit each argument only once, we employ maximal bipartite matching444The previous approach aligned arguments to roles. We measure argument detection, whereas Fitzgerald et al. (2018) measure role detection. between the two sets of arguments, drawing an edge for each pair that passes the above mentioned criterion. The resulting maximal matching determines the true-positive set, while remaining non-aligned arguments become false-positives or false-negatives.

Labeled Argument Detection (LA) All aligned arguments from the previous step are inspected for label equivalence, similar to the joint evaluation reported in Fitzgerald et al. (2018). There may be many correct questions for a role. For example, What was given to someone? and What has been given by someone? both refer to the same semantic role but diverge in grammatical tense, voice, and presence of a syntactical object or subject. Aiming to avoid judging non-equivalent roles as equivalent, we propose Strict-Match to be an equivalence on the following template slots: WH, SUBJ, OBJ, as well as on negation, voice, and modality555Presence of factuality-changing modal verbs such as should, might and can. extracted from the question. Final reported numbers on labelled argument detection rates are based on bipartite aligned arguments passing Strict-Match. We later manually estimate the rate of correct equivalences missed by this conservative method.

As we will see, our evaluation heuristics, adapted from those in

Fitzgerald et al. (2018), significantly underestimate agreement between annotations, hence reflecting performance lower bounds. Devising more tight evaluation measures remains a challenge for future research.

Evaluating Redundant Annotations

We extend our metric for evaluating manual or automatic redundant annotations, like the Dense dataset or the parser in Fitzgerald et al. (2018), which predicts argument spans independently of each other. To that end, we ignore predicted arguments that match ground-truth but are not selected by the bipartite matching due to redundancy. After connecting unmatched predicted arguments that overlap, we count one false positive for every connected component to avoid penalizing precision too harshly when predictions are redundant.666Note that thanks to consolidation, the arguments in our reference gold are non-overlapping.

4 Dataset Quality Analysis

Inter-Annotator Agreement (IAA)

To estimate dataset consistency across different annotations, we measure F1 using our UA metric with 5 generators per predicate. Individual worker-vs-worker agreement yields 79.8 F1 over 10 experiments with 150 predicates, indicating high consistency across our annotators, inline with results by other structured semantic annotations (e.g. Abend and Rappoport (2013)). Overall consistency of the dataset is assessed by measuring agreement between different consolidated annotations, obtained by disjoint triplets of workers, which achieves F1 of 84.1 over 4 experiments, each with 35 distinct predicates. Notably, consolidation boosts agreement, suggesting it is a necessity for semantic annotation consistency.

Dataset Assessment and Comparison

We assess both our gold standard set and the recent Dense set against an integrated expert annotated sample of 100 predicates. To construct the expert set, we blindly merged the Dense set with our worker annotations and manually corrected them. We further corrected the evaluation decisions, accounting for some automatic evaluation mistakes introduced by the span-matching and question paraphrasing criteria. As seen in Table 2, our gold set yields comparable precision with significantly higher recall, which is in line with our 25% higher yield.

This work Dense (2018)
P R F1 P R F1
UA Auto. 79.9 89.4 84.4 67.1 69.5 68.3
Man. 88.0 95.5 91.6 86.4 70.5 77.6
LA Auto. 71.0 79.5 75.0 49.5 51.3 50.4
Man. 88.0 95.5 91.6 83.1 67.8 74.7
Table 2: Automatic and manually-corrected evaluation of our gold standard and Dense Fitzgerald et al. (2018) against the expert annotated sample.

Examining disagreements between our gold and Dense, we observe that our workers successfully produced more roles, both implied and explicit. To a lesser extent, they split more arguments into independent answers, as emphasized by our guidelines, an issue which was left under-specified in the previous annotation guidelines.

Agreement with PropBank Data

It is illuminating to observe the agreement between QA-SRL and PropBank (CoNLL-2009) annotations Hajič et al. (2009). In Table 3, we replicate the experiments in (He et al., 2015, Section 3.4)

for both our gold set and theirs, over a sample of 200 sentences from Wall Street Journal (agreement evaluation is automatic and the metric is somewhat similar to our UA). We report macro-averaged (over predicates) precision and recall for all roles, including core and adjuncts,

777Core roles are A0-A5 in PropBank (recall) and QAs having what and who WH-words in QA-SRL (precision). while considering the PropBank data as the reference set. Our recall of the PropBank roles is notably high, reconfirming the coverage obtained by our annotation protocol.

The measured precision with respect to PropBank is low for adjuncts due to the fact that our annotators were capturing many correct arguments not covered in PropBank. To examine this, we analyzed 100 false positive arguments. Only 32 of those were due to wrong or incomplete QA annotations in our gold, while most others were outside of PropBank’s scope, capturing either implied arguments or roles not covered in PropBank. Extrapolating from this manual analysis estimates our true precision (on all roles) to be about 91%, which is consistent with the 88% precision figure in Table 2. Compared with 2015, our QA-SRL gold yielded 1593 annotations, with 989 core and 604 adjuncts, while theirs yielded 1315 annotations, 979 core and 336 adjuncts. Overall, the comparison to PropBank reinforces the quality of our gold dataset and shows its better coverage relative to the 2015 dataset.

This work He et al. (2015)
P R F1 P R F1
All 73.3 93.0 82.0 81.7 86.6 84.1
Core 87.3 94.8 90.9 86.6 90.4 88.5
Adj. 43.4 85.9 57.7 59.7 64.7 62.1
Table 3: Performance analysis against PropBank. Precision, recall and F1 for all roles, core roles, and adjuncts.

5 Baseline Parser Evaluation

To illustrate the effectiveness of our new gold-standard, we use its Wikinews development set to evaluate the currently available parser from Fitzgerald et al. (2018)

. For each predicate, the parser classifies every span for being an argument, independently of the other spans. Unlike many other SRL systems, this policy often produces outputs with redundant arguments (see appendix for examples). Results for ~1200 predicates are reported in Table

4, demonstrating reasonable performance along with substantial room for improvement, especially with respect to coverage. As expected, the parser’s recall against our gold is substantially lower than the 84.2 recall reported in Fitzgerald et al. (2018) against Dense, due to the limited recall of Dense relative to our gold set.

Automatic Manual
P R F1 P R F1
UA 86.6 58.8 70.1 87.8 66.5 75.5
LA 65.0 44.2 52.6 83.9 64.3 72.8
Table 4: Automatic and manual parser evaluation against 500 Wikinews sentences from the gold dataset. Manual is evaluated on 50 sampled predicates.

Error Analysis

We sample and evaluate 50 predicates to detect correct argument and paraphrase pairs that are skipped by the IOU and Strict-Match criteria. Based on this inspection, the parser completely misses 23% of the 154 roles present in the gold-data, out of which, 17% are implied. While the parser correctly predicts 82% of non-implied roles, it skips half of the implied ones.

6 Conclusion

We introduced a refined crowdsourcing pipeline and a corresponding evaluation methodology for QA-SRL. It enabled us to release a new gold standard for evaluations, notably of much higher coverage of core and implied roles than the previous Dense evaluation dataset. We believe that our annotation methodology and dataset would facilitate future research on natural semantic annotations and QA-SRL parsing.

7 Supplemental Material

7.1 The Question Template

For completeness, we include several examples with some questions restructured into its 7 template slots in Table 5

Why was something cut by someone ?
Why did someone cut something ?
Who might be arrested ?
Table 5: Examples for the question template corresponding to the 7 slots, first two examples are paraphrases.

7.2 Annotation Pipeline

As described in section 3 The consolidator receives two sets of QA annotations and merges them according to the guidelines to produce an exhaustive and consistent QA set. See Table 6 for examples.

A1: Who identified something? The U.S. Geological Survey (USGS)
A2: Who identified something? The U.S. Geological Survey
C: Who identified something The U.S. Geological Survey | USGS
A1: What might contain something? that basin
A2: What contains something? that basin
C: What might contain something? that basin
Table 6: The consolidation task – A1, A2 refer to the original annotator QAs, C refers to the consolidator selected question and corrected answers.

7.3 Redundant Parser Output

As mentioned in the paper body, the Fitzgerald et al. parser generates redundant role questions and answers. The first two rows in Table 7 illustrate different, partly redundant, argument spans for the same question. The next two rows illustrate two paraphrased questions for the same role. Generating such redundant output might complicate downstream use of the parser output as well as evaluation methodology.

What suggests something? Reports
What suggests something? Reports from Minnesota
Where was someone carried? to reclining chairs
What was someone carried to? reclining chairs
Table 7: The parser generates redundant arguments with different paraphrased questions.


  • O. Abend and A. Rappoport (2013) Universal conceptual cognitive annotation (ucca). In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 228–238. Cited by: §4.
  • Y. Chen, W. Y. Wang, and A. I. Rudnicky (2013) Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing. In

    2013 IEEE Workshop on Automatic Speech Recognition and Understanding

    pp. 120–125. Cited by: §1.
  • N. Fitzgerald, J. Michael, L. He, and L. S. Zettlemoyer (2018) Large-scale qa-srl parsing. In ACL, Cited by: §1, §1, §2, §3.1, §3.2, §3.2, §3.2, §3.2, Table 2, §5, footnote 3, footnote 4.
  • J. Hajič, M. Ciaramita, R. Johansson, D. Kawahara, M. A. Martí, L. Màrquez, A. Meyers, J. Nivre, S. Padó, J. Štěpánek,

    et al. (2009)
    The conll-2009 shared task: syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, pp. 1–18. Cited by: §4.
  • L. He, M. Lewis, and L. S. Zettlemoyer (2015) Question-answer driven semantic role labeling: using natural language to annotate natural language. In EMNLP, Cited by: §1, §1, §2, §2, §4, Table 3.
  • D. Marcheggiani, J. Bastings, and I. Titov (2018)

    Exploiting semantics in neural machine translation with graph convolutional networks

    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). External Links: Link, Document Cited by: §1.
  • D. Shen and M. Lapata (2007) Using semantic roles to improve question answering. In

    Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)

    Cited by: §1.
  • H. Wang, M. Bansal, K. Gimpel, and D. McAllester (2015) Machine comprehension with syntax, frames, and semantics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2, pp. 700–706. Cited by: §1.