ACL-Fig: A Dataset for Scientific Figure Classification

01/28/2023
by   Zeba Karishma, et al.
0

Most existing large-scale academic search engines are built to retrieve text-based information. However, there are no large-scale retrieval services for scientific figures and tables. One challenge for such services is understanding scientific figures' semantics, such as their types and purposes. A key obstacle is the need for datasets containing annotated scientific figures and tables, which can then be used for classification, question-answering, and auto-captioning. Here, we develop a pipeline that extracts figures and tables from the scientific literature and a deep-learning-based framework that classifies scientific figures using visual features. Using this pipeline, we built the first large-scale automatically annotated corpus, ACL-Fig, consisting of 112,052 scientific figures extracted from  56K research papers in the ACL Anthology. The ACL-Fig-Pilot dataset contains 1,671 manually labeled scientific figures belonging to 19 categories. The dataset is accessible at https://huggingface.co/datasets/citeseerx/ACL-fig under a CC BY-NC license.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2023

S2abEL: A Dataset for Entity Linking from Scientific Tables

Entity linking (EL) is the task of linking a textual mention to its corr...
research
07/28/2021

Tab2Know: Building a Knowledge Base from Tables in Scientific Papers

Tables in scientific papers contain a wealth of valuable knowledge for t...
research
05/19/2023

DMDD: A Large-Scale Dataset for Dataset Mentions Detection

The recognition of dataset names is a critical task for automatic inform...
research
02/01/2021

Metric-Type Identification for Multi-Level Header Numerical Tables in Scientific Papers

Numerical tables are widely used to present experimental results in scie...
research
03/15/2017

A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks

A key problem in automatic analysis and understanding of scientific pape...
research
11/16/2022

ChartParser: Automatic Chart Parsing for Print-Impaired

Infographics are often an integral component of scientific documents for...
research
12/14/2022

MIST: a Large-Scale Annotated Resource and Neural Models for Functions of Modal Verbs in English Scientific Text

Modal verbs (e.g., "can", "should", or "must") occur highly frequently i...

Please sign up or login with your details

Forgot password? Click here to reset