Aligning benchmark datasets for table structure recognition

03/01/2023
by   Brandon Smock, et al.
0

Benchmark datasets for table structure recognition (TSR) must be carefully processed to ensure they are annotated consistently. However, even if a dataset's annotations are self-consistent, there may be significant inconsistency across datasets, which can harm the performance of models trained and evaluated on them. In this work, we show that aligning these benchmarksx2014removing both errors and inconsistency between themx2014improves model performance significantly. We demonstrate this through a data-centric approach where we adopt a single model architecture, the Table Transformer (TATR), that we hold fixed throughout. Baseline exact match accuracy for TATR evaluated on the ICDAR-2013 benchmark is 65 combined. After reducing annotation mistakes and inter-dataset inconsistency, performance of TATR evaluated on ICDAR-2013 increases substantially to 75 trained on PubTables-1M, 65 show through ablations over the modification steps that canonicalization of the table annotations has a significantly positive effect on performance, while other choices balance necessary trade-offs that arise when deciding a benchmark dataset's final composition. Overall we believe our work has significant implications for benchmark design for TSR and potentially other tasks as well. All dataset processing and training code will be released.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/14/2023

Rethinking Image-based Table Recognition Using Weakly Supervised Methods

Most of the previous methods for table recognition rely on training data...
research
09/30/2021

Scientific evidence extraction

Recently, interest has grown in applying machine learning to the problem...
research
02/02/2023

CTE: A Dataset for Contextualized Table Extraction

Relevant information in documents is often summarized in tables, helping...
research
03/27/2023

A large-scale dataset for end-to-end table recognition in the wild

Table recognition (TR) is one of the research hotspots in pattern recogn...
research
04/21/2021

Guided Table Structure Recognition through Anchor Optimization

This paper presents the novel approach towards table structure recogniti...
research
05/19/2022

TransTab: Learning Transferable Tabular Transformers Across Tables

Tabular data (or tables) are the most widely used data format in machine...
research
05/05/2021

PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Table Image Recognition to Latex

This paper presents our solution for the ICDAR 2021 Competition on Scien...

Please sign up or login with your details

Forgot password? Click here to reset