Detecting Table Region in PDF Documents Using Distant Supervision

06/29/2015
by   Miao Fan, et al.
0

Superior to state-of-the-art approaches which compete in table recognition with 67 annotated government reports in PDF format released by ICDAR 2013 Table Competition, this paper contributes a novel paradigm leveraging large-scale unlabeled PDF documents to open-domain table detection. We integrate the paradigm into our latest developed system ( PdfExtra) to detect the region of tables by means of 9,466 academic articles from the entire repository of ACL Anthology, where almost all papers are archived by PDF format without annotation for tables. The paradigm first designs heuristics to automatically construct weakly labeled data. It then feeds diverse evidences, such as layouts of documents and linguistic features, which are extracted by Apache PDFBox and processed by Stanford NLP toolkit, into different canonical classifiers. We finally use these classifiers, i.e. Naive Bayes, Logistic Regression and Support Vector Machine, to collaboratively vote on the region of tables. Experimental results show that PdfExtra achieves a great leap forward, compared with the state-of-the-art approach. Moreover, we discuss the factors of different features, learning models and even domains of documents that may impact the performance. Extensive evaluations demonstrate that our paradigm is compatible enough to leverage various features and learning models for open-domain table region detection within PDF files.

READ FULL TEXT
research
05/01/2020

Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context

Documents are often the format of choice for knowledge sharing and prese...
research
01/15/2019

Integrating and querying similar tables from PDF documents using deep learning

Large amount of public data produced by enterprises are in semi-structur...
research
11/25/2019

Image-based table recognition: data, model, and evaluation

Important information that relates to a specific topic in a document is ...
research
10/06/2021

On Cropped versus Uncropped Training Sets in Tabular Structure Detection

Automated document processing for tabular information extraction is high...
research
04/17/2023

DIALITE: Discover, Align and Integrate Open Data Tables

We demonstrate a novel table discovery pipeline called DIALITE that allo...
research
07/12/2021

Split, embed and merge: An accurate table structure recognizer

The task of table structure recognition is to recognize the internal str...
research
11/03/2022

Efficient Information Sharing in ICT Supply Chain Social Network via Table Structure Recognition

The global Information and Communications Technology (ICT) supply chain ...

Please sign up or login with your details

Forgot password? Click here to reset