DMDD: A Large-Scale Dataset for Dataset Mentions Detection

05/19/2023
by   Huitong Pan, et al.
0

The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises of 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/01/2020

SciREX: A Challenge Dataset for Document-Level Information Extraction

Extracting information from full documents is an important problem in ma...
research
01/28/2023

ACL-Fig: A Dataset for Scientific Figure Classification

Most existing large-scale academic search engines are built to retrieve ...
research
12/14/2022

MIST: a Large-Scale Annotated Resource and Neural Models for Functions of Modal Verbs in English Scientific Text

Modal verbs (e.g., "can", "should", or "must") occur highly frequently i...
research
12/01/2020

HORAE: an annotated dataset of books of hours

We introduce in this paper a new dataset of annotated pages from books o...
research
05/06/2019

A Large Parallel Corpus of Full-Text Scientific Articles

The Scielo database is an important source of scientific information in ...
research
10/18/2022

Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature

Lay summarisation aims to jointly summarise and simplify a given text, t...
research
08/18/2022

Challenges and opportunities in applying Neural Temporal Point Processes to large scale industry data

In this work, we identify open research opportunities in applying Neural...

Please sign up or login with your details

Forgot password? Click here to reset