CTE: A Dataset for Contextualized Table Extraction

02/02/2023
by   Andrea Gemelli, et al.
0

Relevant information in documents is often summarized in tables, helping the reader to identify useful facts. Most benchmark datasets support either document layout analysis or table understanding, but lack in providing data to apply both tasks in a unified way. We define the task of Contextualized Table Extraction (CTE), which aims to extract and define the structure of tables considering the textual context of the document. The dataset comprises 75k fully annotated pages of scientific papers, including more than 35k tables. Data are gathered from PubMed Central, merging the information provided by annotations in the PubTables-1M and PubLayNet datasets. The dataset can support CTE and adds new classes to the original ones. The generated annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis, table detection, structure recognition, and functional analysis. We formally define CTE and evaluation metrics, showing which subtasks can be tackled, describing advantages, limitations, and future works of this collection of data. Annotations and code will be accessible a https://github.com/AILab-UniFI/cte-dataset.

READ FULL TEXT

page 5

page 10

research
08/23/2022

Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents

Tables are widely used in several types of documents since they can brin...
research
09/30/2021

Scientific evidence extraction

Recently, interest has grown in applying machine learning to the problem...
research
12/06/2022

Multimodal Tree Decoder for Table of Contents Extraction in Document Images

Table of contents (ToC) extraction aims to extract headings of different...
research
06/30/2023

TTSWING: a Dataset for Table Tennis Swing Analysis

We introduce TTSWING, a novel dataset designed for table tennis swing an...
research
02/16/2021

TableLab: An Interactive Table Extraction System with Adaptive Deep Learning

Table extraction from PDF and image documents is a ubiquitous task in th...
research
05/04/2023

Revisiting Table Detection Datasets for Visually Rich Documents

Table Detection has become a fundamental task for visually rich document...
research
03/01/2023

Aligning benchmark datasets for table structure recognition

Benchmark datasets for table structure recognition (TSR) must be careful...

Please sign up or login with your details

Forgot password? Click here to reset