ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations

06/23/2021
by   Sampanna Yashwant Kahu, et al.
0

We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin.

READ FULL TEXT
research
08/23/2022

Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents

Tables are widely used in several types of documents since they can brin...
research
08/23/2022

Data augmentation on graphs for table type classification

Tables are widely used in documents because of their compact and structu...
research
04/26/2023

SIMARA: a database for key-value information extraction from full pages

We propose a new database for information extraction from historical han...
research
04/06/2018

Extracting Scientific Figures with Distantly Supervised Neural Networks

Non-textual components such as charts, diagrams and tables provide key i...
research
04/17/2018

A Saliency-based Convolutional Neural Network for Table and Chart Detection in Digitized Documents

Deep Convolutional Neural Networks (DCNNs) have recently been applied su...
research
12/24/2014

Locating Tables in Scanned Documents for Reconstructing and Republishing (ICIAfS14)

Pool of knowledge available to the mankind depends on the source of lear...
research
09/10/2008

Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents

Most search engines index the textual content of documents in digital li...

Please sign up or login with your details

Forgot password? Click here to reset