Revisiting Table Detection Datasets for Visually Rich Documents

05/04/2023
by   Bin Xiao, et al.
0

Table Detection has become a fundamental task for visually rich document understanding with the surging number of electronic documents. There have been some open datasets widely used in many studies. However, popular available datasets have some inherent limitations, including the noisy and inconsistent samples, and the limit number of training samples, and the limit number of data-sources. These limitations make these datasets unreliable to evaluate the model performance and cannot reflect the actual capacity of models. Therefore, in this paper, we revisit some open datasets with high quality of annotations, identify and clean the noise, and align the annotation definitions of these datasets to merge a larger dataset, termed with Open-Tables. Moreover, to enrich the data sources, we propose a new dataset, termed with ICT-TD, using the PDF files of Information and communication technologies (ICT) commodities which is a different domain containing unique samples that hardly appear in open datasets. To ensure the label quality of the dataset, we annotated the dataset manually following the guidance of a domain expert. The proposed dataset has a larger intra-variance and smaller inter-variance, making it more challenging and can be a sample of actual cases in the business context. We built strong baselines using various state-of-the-art object detection models and also built the baselines in the cross-domain setting. Our experimental results show that the domain difference among existing open datasets are small, even they have different data-sources. Our proposed Open-tables and ICT-TD are more suitable for the cross domain setting, and can provide more reliable evaluation for model because of their high quality and consistent annotations.

READ FULL TEXT
research
03/22/2019

Line-items and table understanding in structured documents

Table detection and extraction has been studied in the context of docume...
research
05/30/2023

Table Detection for Visually Rich Document Images

Table Detection (TD) is a fundamental task towards visually rich documen...
research
02/02/2023

CTE: A Dataset for Contextualized Table Extraction

Relevant information in documents is often summarized in tables, helping...
research
10/25/2019

Current Limitations in Cyberbullying Detection: on Evaluation Criteria, Reproducibility, and Data Scarcity

The detection of online cyberbullying has seen an increase in societal i...
research
08/06/2020

IIIT-AR-13K: A New Dataset for Graphical Object Detection in Documents

We introduce a new dataset for graphical object detection in business do...
research
12/03/2019

An Annotated Dataset of Coreference in English Literature

We present in this work a new dataset of coreference annotations for wor...
research
11/08/2019

Accessible tables in digital documents

Accessibility of tables on websites for Visually Impaired Persons (VIP) ...

Please sign up or login with your details

Forgot password? Click here to reset