Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

09/09/2022
by   J. P. Naiman, et al.
0

Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, post-Optical Character Recognition (OCR), which uses both grayscale and OCR-features. When applied to the astrophysics literature holdings of the Astrophysics Data System (ADS), we find F1 scores of 90.9 (92.2 cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.

READ FULL TEXT
research
02/22/2023

The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions

Scientific articles published prior to the "age of digitization" in the ...
research
09/20/2023

Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Scientific articles published prior to the "age of digitization" ( 1997)...
research
04/24/2023

Generating Topic Pages for Scientific Concepts Using Scientific Publications

In this paper, we describe Topic Pages, an inventory of scientific conce...
research
08/22/2023

An extensible point-based method for data chart value detection

We present an extensible method for identifying semantic points to rever...
research
04/30/2020

Efficiently Reclaiming Space in a Log Structured Store

A log structured store uses a single write I/O for a number of diverse a...
research
08/11/2022

Figure Descriptive Text Extraction using Ontological Representation

Experimental research publications provide figure form resources includi...

Please sign up or login with your details

Forgot password? Click here to reset