Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts

10/11/2022
by   Juntao Yu, et al.
0

Although several datasets annotated for anaphoric reference/coreference exist, even the largest such datasets have limitations in terms of size, range of domains, coverage of anaphoric phenomena, and size of documents included. Yet, the approaches proposed to scale up anaphoric annotation haven't so far resulted in datasets overcoming these limitations. In this paper, we introduce a new release of a corpus for anaphoric reference labelled via a game-with-a-purpose. This new release is comparable in size to the largest existing corpora for anaphoric reference due in part to substantial activity by the players, in part thanks to the use of a new resolve-and-aggregate paradigm to 'complete' markable annotations through the combination of an anaphoric resolver and an aggregation method for anaphoric reference. The proposed method could be adopted to greatly speed up annotation time in other projects involving games-with-a-purpose. In addition, the corpus covers genres for which no comparable size datasets exist (Fiction and Wikipedia); it covers singletons and non-referring expressions; and it includes a substantial number of long documents (> 2K in length).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/02/2020

NUBES: A Corpus of Negation and Uncertainty in Spanish Clinical Texts

This paper introduces the first version of the NUBes corpus (Negation an...
research
10/07/2022

Longtonotes: OntoNotes with Longer Coreference Chains

Ontonotes has served as the most important benchmark for coreference res...
research
02/25/2019

MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts

This paper presents the formal release of MedMentions, a new manually an...
research
11/08/2019

Crowdsourcing a High-Quality Gold Standard for QA-SRL

Question-answer driven Semantic Role Labeling (QA-SRL) has been proposed...
research
11/20/2019

Casting a Wide Net: Robust Extraction of Potentially Idiomatic Expressions

Idiomatic expressions like `out of the woods' and `up the ante' present ...
research
10/14/2022

A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing

Foundational Hebrew NLP tasks such as segmentation, tagging and parsing,...

Please sign up or login with your details

Forgot password? Click here to reset