Locating Tables in Scanned Documents for Reconstructing and Republishing (ICIAfS14)

12/24/2014
by   Akmal Jahan Mac, et al.
0

Pool of knowledge available to the mankind depends on the source of learning resources, which can vary from ancient printed documents to present electronic material. The rapid conversion of material available in traditional libraries to digital form needs a significant amount of work if we are to maintain the format and the look of the electronic documents as same as their printed counterparts. Most of the printed documents contain not only characters and its formatting but also some associated non text objects such as tables, charts and graphical objects. It is challenging to detect them and to concentrate on the format preservation of the contents while reproducing them. To address this issue, we propose an algorithm using local thresholds for word space and line height to locate and extract all categories of tables from scanned document images. From the experiments performed on 298 documents, we conclude that our algorithm has an overall accuracy of about 75 scanned document images. Since the algorithm does not completely depend on rule lines, it can detect all categories of tables in a range of scanned documents with different font types, styles and sizes to extract their formatting features. Moreover, the algorithm can be applied to locate tables in multi column layouts with small modification in layout analysis. Treating tables with their existing formatting features will tremendously help the reproducing of printed documents for reprinting and updating purposes.

READ FULL TEXT
research
10/10/2022

A two-stage approach for table extraction in invoices

The automated analysis of administrative documents is an important field...
research
02/07/2022

Combining Deep Learning and Reasoning for Address Detection in Unstructured Text Documents

Extracting information from unstructured text documents is a demanding t...
research
08/26/2023

Bengali Document Layout Analysis with Detectron2

Document digitization is vital for preserving historical records, effici...
research
06/23/2021

ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations

We focus on electronic theses and dissertations (ETDs), aiming to improv...
research
01/15/2019

Integrating and querying similar tables from PDF documents using deep learning

Large amount of public data produced by enterprises are in semi-structur...
research
06/21/2022

Document Navigability: A Need for Print-Impaired

Printed documents continue to be a challenge for blind, low-vision, and ...
research
04/17/2018

A Saliency-based Convolutional Neural Network for Table and Chart Detection in Digitized Documents

Deep Convolutional Neural Networks (DCNNs) have recently been applied su...

Please sign up or login with your details

Forgot password? Click here to reset