Longtonotes: OntoNotes with Longer Coreference Chains

10/07/2022
by   Kumar Shridhar, et al.
0

Ontonotes has served as the most important benchmark for coreference resolution. However, for ease of annotation, several long documents in Ontonotes were split into smaller parts. In this work, we build a corpus of coreference-annotated documents of significantly longer length than what is currently available. We do so by providing an accurate, manually-curated, merging of annotations from documents that were split into multiple parts in the original Ontonotes annotation process. The resulting corpus, which we call LongtoNotes contains documents in multiple genres of the English language with varying lengths, the longest of which are up to 8x the length of documents in Ontonotes, and 2x those in Litbank. We evaluate state-of-the-art neural coreference systems on this new corpus, analyze the relationships between model architectures/hyperparameters and document length on performance and efficiency of the models, and demonstrate areas of improvement in long-document coreference modeling revealed by our new corpus. Our data and code is available at: https://github.com/kumar-shridhar/LongtoNotes.

READ FULL TEXT

page 12

page 13

research
06/06/2017

Marmara Turkish Coreference Corpus and Coreference Resolution Baseline

We describe the Marmara Turkish Coreference Corpus, which is an annotati...
research
04/28/2023

CED: Catalog Extraction from Documents

Sentence-by-sentence information extraction from long documents is an ex...
research
12/03/2019

An Annotated Dataset of Coreference in English Literature

We present in this work a new dataset of coreference annotations for wor...
research
07/19/2023

IncDSI: Incrementally Updatable Document Retrieval

Differentiable Search Index is a recently proposed paradigm for document...
research
10/11/2022

Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts

Although several datasets annotated for anaphoric reference/coreference ...
research
01/25/2022

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

We present a novel benchmark and associated evaluation metrics for asses...
research
04/13/2018

Are Abstracts Enough for Hypothesis Generation?

The potential for automatic hypothesis generation (HG) systems to improv...

Please sign up or login with your details

Forgot password? Click here to reset