SciREX: A Challenge Dataset for Document-Level Information Extraction

05/01/2020
by   Sarthak Jain, et al.
0

Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. Our data and code are publicly available at https://github.com/allenai/SciREX

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2021

Discriminative Reasoning for Document-level Relation Extraction

Document-level relation extraction (DocRE) models generally use graph ne...
research
05/19/2023

DMDD: A Large-Scale Dataset for Dataset Mentions Detection

The recognition of dataset names is a critical task for automatic inform...
research
04/27/2023

ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task

Transformer-based language models, including ChatGPT, have demonstrated ...
research
06/21/2021

ArgFuse: A Weakly-Supervised Framework for Document-Level Event Argument Aggregation

Most of the existing information extraction frameworks (Wadden et al., 2...
research
12/09/2020

Simple or Complex? Learning to Predict Readability of Bengali Texts

Determining the readability of a text is the first step to its simplific...
research
01/29/2021

CD2CR: Co-reference Resolution Across Documents and Domains

Cross-document co-reference resolution (CDCR) is the task of identifying...
research
07/31/2023

Workshop on Document Intelligence Understanding

Document understanding and information extraction include different task...

Please sign up or login with your details

Forgot password? Click here to reset