Scientific evidence extraction

09/30/2021
by   Brandon Smock, et al.
0

Recently, interest has grown in applying machine learning to the problem of table structure inference and extraction from unstructured documents. However, progress in this area has been challenging both to make and to measure, due to several issues that arise in training and evaluating models from labeled data. This includes challenges as fundamental as the lack of a single definitive ground truth output for each input sample and the lack of an ideal metric for measuring partial correctness for this task. To address these we propose a new dataset, PubMed Tables One Million (PubTables-1M), and a new class of metric, grid table similarity (GriTS). PubTables-1M is nearly twice as large as the previous largest comparable dataset, can be used for models across multiple architectures and modalities, and addresses issues such as ambiguity and lack of consistency in the annotations. We apply DETR to table extraction for the first time and show that object detection models trained on PubTables-1M produce excellent results out-of-the-box for all three tasks of detection, structure recognition, and functional analysis. We describe the dataset in detail to enable others to build on our work and combine this data with other datasets for these and related tasks. It is our hope that PubTables-1M and the proposed metrics can further progress in this area by creating a benchmark suitable for training and evaluating a wide variety of models for table extraction. Data and code will be released at https://github.com/microsoft/table-transformer.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2022

GriTS: Grid table similarity metric for table structure recognition

In this paper, we propose a new class of evaluation metric for table str...
research
02/02/2023

CTE: A Dataset for Contextualized Table Extraction

Relevant information in documents is often summarized in tables, helping...
research
03/01/2023

Aligning benchmark datasets for table structure recognition

Benchmark datasets for table structure recognition (TSR) must be careful...
research
04/29/2020

AxCell: Automatic Extraction of Results from Machine Learning Papers

Tracking progress in machine learning has become increasingly difficult ...
research
07/31/2022

Evaluating Table Structure Recognition: A New Perspective

Existing metrics used to evaluate table structure recognition algorithms...
research
02/16/2021

TableLab: An Interactive Table Extraction System with Adaptive Deep Learning

Table extraction from PDF and image documents is a ubiquitous task in th...
research
05/30/2021

ICDAR 2021 Competition on Scientific Table Image Recognition to LaTeX

Tables present important information concisely in many scientific docume...

Please sign up or login with your details

Forgot password? Click here to reset