Combining OCR Models for Reading Early Modern Printed Books

05/11/2023
by   Mathias Seuret, et al.
2

In this paper, we investigate the usage of fine-grained font recognition on OCR for books printed from the 15th to the 18th century. We used a newly created dataset for OCR of early printed books for which fonts are labeled with bounding boxes. We know not only the font group used for each character, but the locations of font changes as well. In books of this period, we frequently find font group changes mid-line or even mid-word that indicate changes in language. We consider 8 different font groups present in our corpus and investigate 13 different subsets: the whole dataset and text lines with a single font, multiple fonts, Roman fonts, Gothic fonts, and each of the considered fonts, respectively. We show that OCR performance is strongly impacted by font style and that selecting fine-tuned models with font group recognition has a very positive impact on the results. Moreover, we developed a system using local font group recognition in order to combine the output of multiple font recognition models, and show that while slower, this approach performs better not only on text lines composed of multiple fonts but on the ones containing a single font only as well.

READ FULL TEXT

page 8

page 10

research
05/06/2020

Automated Transcription for Pre-Modern Japanese Kuzushiji Documents by Random Lines Erasure and Curriculum Learning

Recognizing the full-page of Japanese historical documents is a challeng...
research
03/02/2017

BoxCars: Improving Fine-Grained Recognition of Vehicles using 3D Bounding Boxes in Traffic Surveillance

In this paper, we focus on fine-grained recognition of vehicles mainly i...
research
12/27/2022

A Comprehensive Gold Standard and Benchmark for Comics Text Detection and Recognition

This study focuses on improving the optical character recognition (OCR) ...
research
03/16/2021

Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Text line segmentation is one of the pre-stages of modern optical charac...
research
11/15/2020

BanglaWriting: A multi-purpose offline Bangla handwriting dataset

This article presents a Bangla handwriting dataset named BanglaWriting t...
research
08/29/2023

Is it an i or an l: Test-time Adaptation of Text Line Recognition Models

Recognizing text lines from images is a challenging problem, especially ...
research
02/24/2022

Some Stylometric Remarks on Ovid's Heroides and the Epistula Sapphus

This article aims to contribute to two well-worn areas of debate in clas...

Please sign up or login with your details

Forgot password? Click here to reset