Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

09/20/2023
by   Jill P. Naiman, et al.
0

Scientific articles published prior to the "age of digitization" ( 1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the arXiv we create, to the authors' knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. We provide baseline models trained with this dataset and find the mean improvement in character and word error rates of 7.71 parts of sentences as inline math, we find a classification F1 score of 77.82 Interactive dashboards to explore the dataset are available online: https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023, and data and code, within the limitations of our agreement with the arXiv, are hosted on GitHub: https://github.com/ReadingTimeMachine/ocr_post_correction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2021

Post-OCR Document Correction with large Ensembles of Character Sequence Models

In this paper, we propose a novel method based on character sequence-to-...
research
08/22/2023

An extensible point-based method for data chart value detection

We present an extensible method for identifying semantic points to rever...
research
11/18/2022

Let's Enhance: A Deep Learning Approach to Extreme Deblurring of Text Images

This work presents a novel deep-learning-based pipeline for the inverse ...
research
09/09/2022

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

Scientific articles published prior to the "age of digitization" in the ...
research
09/06/2018

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in R...
research
11/17/2021

Character Transformations for Non-Autoregressive GEC Tagging

We propose a character-based nonautoregressive GEC approach, with automa...

Please sign up or login with your details

Forgot password? Click here to reset