Image-based table recognition: data, model, and evaluation

11/25/2019
by   Xu Zhong, et al.
0

Important information that relates to a specific topic in a document is often organized in tabular format to assist readers with information retrieval and comparison, which may be difficult to provide in natural language. However, tabular data in unstructured digital documents, e.g., Portable Document Format (PDF) and images, are difficult to parse into structured machine-readable format, due to complexity and diversity in their structure and style. To facilitate image-based table recognition with deep learning, we develop the largest publicly available table recognition dataset PubTabNet (https://github.com/ibm-aur-nlp/PubTabNet), containing 568k table images with corresponding structured HTML representation. PubTabNet is automatically generated by matching the XML and PDF representations of the scientific articles in PubMed Central Open Access Subset (PMCOA). We also propose a novel attention-based encoder-dual-decoder (EDD) architecture that converts images of tables into HTML code. The model has a structure decoder which reconstructs the table structure and helps the cell decoder to recognize cell content. In addition, we propose a new Tree-Edit-Distance-based Similarity (TEDS) metric for table recognition. The experiments demonstrate that the EDD model can accurately recognize complex tables solely relying on the image representation, outperforming the state-of-the-art by 7.7

READ FULL TEXT
research
03/05/2019

TableBank: Table Benchmark for Image-based Table Detection and Recognition

We present TableBank, a new image-based table detection and recognition ...
research
03/17/2020

GFTE: Graph-based Financial Table Extraction

Tabular data is a crucial form of information expression, which can orga...
research
12/06/2022

Multimodal Tree Decoder for Table of Contents Extraction in Document Images

Table of contents (ToC) extraction aims to extract headings of different...
research
04/20/2018

A Mixed Hierarchical Attention based Encoder-Decoder Approach for Standard Table Summarization

Structured data summarization involves generation of natural language su...
research
03/08/2022

Table Structure Recognition with Conditional Attention

Tabular data in digital documents is widely used to express compact and ...
research
07/12/2021

Split, embed and merge: An accurate table structure recognizer

The task of table structure recognition is to recognize the internal str...
research
06/29/2015

Detecting Table Region in PDF Documents Using Distant Supervision

Superior to state-of-the-art approaches which compete in table recogniti...

Please sign up or login with your details

Forgot password? Click here to reset