Identifying events from a given ontology in text and locating their arguments is an especially challenging task because events vary widely in their textual realizations and their arguments are often spread across multiple clauses or sentences. Most event research has been in the context of the 2005 NIST Automatic Content Extraction (ace) sentence-level event mention task [Walker et al.2006], which also provides the standard corpus. Recently, tac kbp has introduced document-level event argument extraction shared tasks for 2014 and 2015 (kbp ea).
Progress on events since ACE has been limited. Most subsequent work has tried improve performance through the use of more complex inference [Li et al.2013], by transductively drawing on outside sources of information, or both [Ji and Grishman2008, Hong et al.2011]
. Such approaches have produced modest reductions in error over a pipeline of simple classifiers trained on ACE.
In our efforts to improve on the kbp ea 2014 systems, we were stymied by a lack of data, especially for rarer event types. Ten of the 33 event types have fewer than 25 training examples in ace, and even for more frequent events, many trigger words and classes of arguments occurred only once. Furthermore, the 2015 task would include new argument types. These problems motivated the following question: (a) are we at a plateau in the performance vs. annotation time curve? (b) is there an viable alternative to full-document annotation, especially for rarer event types? (c) for novel event types or languages, how quickly can a useful event model be trained?
In traditional annotation, a static corpus selected to be rich in the target event types is annotated. Active learning augments existing training data by having a human oracle annotate system queries (or features [Settles2011]). We explored a novel form of annotation, curated training (ct), in which teachers (annotators) actively seek out informative training examples.
2 Curated Training
In ct the teacher created a prioritized indicator list of words and phrases which could indicate a target event’s presence. Given a tool with a search box, a document list, and a document text pane, teachers searched111over Gigaword 5 [Parker et al.2011] using Indri [Strohman et al.2005] for indicators in priority order and annotated ten documents each. On loading a document, they used their browser’s search to locate a single sentence containing the indicator.
If the sentence mentioned multiple instances of the target event or was unclear, it was skipped. If it contained no mention of the event, they marked it negative.222Negated, future, and hypothetical events were all considered mentions of an event, not negatives Otherwise, they (a) marked the sentence as event-present; (b) applied the anchor annotation to the tokens333if there were no anchors, the document was skipped whose presence makes the presence of the event likely; (c) marked each argument span within the selected sentence; and (d) marked any other spans they thought might be ‘educational’ as interesting. (d) was also done for negative sentences.
Teachers were permitted to annotate extra documents if an indicator seemed ambiguous. They looked very briefly (2-3 seconds) in the context of selected sentences to see if there were additional informative instances to annotate. If any non-indicator anchor was marked, it was added to the indicator list with high priority. The process was repeated for four hours or until the teacher felt additional ct would not be useful.444See curves in Figure 2. In many cases annotation appears to terminate early because annotators had no way of tracking when they hit exactly four hours.
2.1 Data Gathered
We recruited three teachers without NLP backgrounds but with annotation experience. We consider here only the teacher (a) who completed all event types in time for assessment. Teacher a averaged seven minutes brainstorming indicators and produced 6,205 event presence, 5137 negative, and 13,459 argument annotations. Every teacher action was time-stamped. For analysis, we updated the timestamps to remove breaks longer than two minutes.555Annotators were never to spend more than 2-3 seconds on any decision
Since the ct was stored as character offsets, we aligned it to parses to get ace-style event mentions. For training our argument attachment models, we omit any event mentions where any annotation failed to project. Projecting Teacher a’s data produced 5792 event mentions for trigger training, 5221 for argument training, and 4,954 negatives.666c.f. roughly 5,300 event mentions in ace
Our target evaluation task is kbp-ea [NIST2014] which requires mapping a document to a set of tuples indicating that an entity plays the role in an event of type with realis . Scoring is F1 over these tuples.777The details of the scoring are given in [NIST2014]. We evalute over the the 2014 newswire evaluation corpus [Joe Ellis and Strassel2015] using the scorer888https://github.com/isi-nlp/tac-kbp-eal on the evaluation key augmented with assessments by Teacher a of responses from our system not found therein.999 The evaluation answer key had assessments of all 2014 system response and those of an an LDC annotator operating under significant time-pressure (thirty minutes per document) To focus on event detection and argument attachment, we enabled the neutralizeRealis and attemptToNeutralizeCoref scorer options.
The highest-performing system in kbp ea 2014, bbn1
, ran a pipeline of four log-linear classifiers (trigger detection, argument attachment, genericity assignment, and a trigger-less argument model) in a high-recall mode which output all event mentions and arguments scoring above 10% probability. This output was fed into a series of inference rules and a score was computed based on the sub-model scores and the inference rules applied[Chan et al.2014].
We used this evaluation system for the experiments in this paper with two changes. First, bbn1 used a multi-class model for trigger detection, while we use one binary model per event type because with ct each type has a different set of negative examples. Second, we omitted the ‘trigger-less’ argument classifier for simplicity. This version, baseline, lags bbn1’s performance by 0.8 F1 but outperforms all other 2014 evaluation systems by a large margin.
To compare against full document annotation, we needed to estimate how long the event-only portion oface annotation took.101010Excluding coreference, etc. The LDC111111personal communication ventured a rough estimate of 1500 words per hour (about twenty minutes per ace document). The LDC human annotator in kbp-ea 2014 was allocated thirty minutes per document [Freedman and Gabbard2014]. We use the former estimate. To estimate performance with a fraction of ace, we used the first % documents as needed.
In aggregate ct’s performance closely tracked ace for small amounts of mean annotation time per event (Figure 1). However, the performance of ct plateaus more slowly than ACE, beginning to diverge around ninety minutes per event, and continuing to increase sharply at the end of our annotation, leaving unclear what the potential performance of the technique is. When added to ace, the ct improves performance somewhat, reducing error of P/R/F 1%/5%/6% at ninety minutes per event before plateauing. ct has a substantial advantage over ace for event types which are rare in ace, but lags significantly for event types abundant in ace (Figure 2).121212The anomalously poor performance on transaction.transfer-money is due to a bug.
The annotation tool designer and two other NLP experts also did ct for conflict.demonstrate (Figure 3; Table 1). All experts significantly outperformed Teacher a and ace in terms of F1. In two cases this is because the experts sacrificed precision for recall. The second expert matched Teacher a’s precision with much higher recall. Annotators varied widely in the volume of their annotation and indicator searches, but this did not have a clear relationship to performance.
|Tch. a||Designer||Exp. 1||Exp. 2|
4.1 Possible Confounding Factors
Because Teacher a both provided ct and did the output assessment, improvements may reflect the system learning their biases. We controlled for this somewhat by having Teacher B dual-assess several hundred responses, resulting in encouraging agreement rates of 95% for event presence, 98% for role selection, and 98% for argument assessment.131313aet, aer, and bf in kbp ea terms [Joe Ellis and Strassel2014] For some events, the guidelines changed from ace to kbp ea 2014 by eliminating ‘trumping’ rules and expanding allowable inference, which could also account for some improvement. If either of these were significant factors, it would suggest that ct may be a useful tool for retargetting systems to new, related tasks.
Thanks to Elizabeth Boschee and Dan Wholey for doing annotation. This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. Distribution ‘A’: Approved For Public Release, Distribution Unlimited.
6 Bibliographical References
- [Chan et al.2014] Chan, Y. S., Freedman, M., and Gabbard, R. (2014). BBN’s KBP EA System. In Proceedings of NIST TAC 2014. http://www.nist.gov/tac/protected/2014/TAC2014-workshop-notebook/results.html.
- [Freedman and Gabbard2014] Freedman, M. and Gabbard, R. (2014). An Overview of the TAC KBP Event Argument Extraction Evaluation. In Proceedings of NIST TAC 2014. http://www.nist.gov/tac/protected/2014/TAC2014-workshop-notebook/results.html.
- [Hong et al.2011] Hong, Y., Zhang, J., Ma, B., Yao, J., Zhou, G., and Zhu, Q. (2011). Using Cross-Entity Inference to Improve Event Extraction. In Proceedings of the ACL 2011.
- [Ji and Grishman2008] Ji, H. and Grishman, R. (2008). Refining Event Extraction Through Cross-Document Inference. In Proceedings of the ACL 2008.
- [Joe Ellis and Strassel2014] Joe Ellis, J. G. and Strassel, S. (2014). TAC KBP 2014 Event Argument Extraction Assessment Guidelines V. 1.4. http://www.nist.gov/tac/2014/KBP/Event/guidelines/TAC_KBP_2014_Event_Argument_Extraction_Assessment_Guidelines_V1.4.pdf.
- [Joe Ellis and Strassel2015] Joe Ellis, J. G. and Strassel, S. (2015). LDC2015E22: TAC KBP English Event Argument Extraction Comprehensive Pilot and Evaluation Data 2014.
- [Li et al.2013] Li, Q., Ji, H., and Huang, L. (2013). Joint Event Extraction via Structured Prediction with Global Features. In Proceedings of the ACL 2013.
- [NIST2014] NIST. (2014). TAC KBP 2014 Event Argument Task Description. http://www.nist.gov/tac/2014/KBP/Event/guidelines/EventArgumentTaskDescription.09042014.pdf.
- [Parker et al.2011] Robert Parker and David Graff and Junbo Kong and Ke Chen and Kazuaki Maeda. (2011). English Gigaword Fifth Edition. Linguistic Data Consortium, ISLRN 911-942-430-413-0.
- [Settles2011] Settles, B. (2011). Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances. In Proceedings of EMNLP 2011.
- [Strohman et al.2005] Strohman, T., Metzler, D., Turtle, H., and Croft, W. B. (2005). Indri: A language model-based search engine for complex queries. Proceedings of the International Conference on Intelligent Analysis, 2(6):2–6.
- [Walker et al.2006] Walker, C., Strassel, S., Medero, J., and Maeda, K. (2006). ACE 2005 Multilingual Training Corpus.