The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions

02/22/2023
by   Jill P. Naiman, et al.
0

Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, after they have been processed with Optical Character Recognition (OCR), which uses both grayscale and OCR-features. We focus our efforts on translating the intersection-over-union (IOU) metric from the field of object detection to document layout analysis and quantify "high localization" levels as an IOU of 0.9. When applied to the astrophysics literature holdings of the NASA Astrophysics Data System (ADS), we find F1 scores of 90.9 0.9 which is a significant improvement over other state-of-the-art methods.

READ FULL TEXT

page 10

page 13

page 16

page 17

page 29

research
09/09/2022

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

Scientific articles published prior to the "age of digitization" in the ...
research
09/30/2020

Learning Object Detection from Captions via Textual Scene Attributes

Object detection is a fundamental task in computer vision, requiring lar...
research
08/22/2023

An extensible point-based method for data chart value detection

We present an extensible method for identifying semantic points to rever...
research
06/02/2022

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

Accurate document layout analysis is a key requirement for high-quality ...
research
09/10/2019

Chargrid-OCR: End-to-end trainable Optical Character Recognition through Semantic Segmentation and Object Detection

We present an end-to-end trainable approach for optical character recogn...
research
01/20/2017

Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

This paper provides the first thorough documentation of a high quality d...

Please sign up or login with your details

Forgot password? Click here to reset