Scalable handwritten text recognition system for lexicographic sources of under-resourced languages and alphabets

03/28/2023
by   Jan Idziak, et al.
0

The paper discusses an approach to decipher large collections of handwritten index cards of historical dictionaries. Our study provides a working solution that reads the cards, and links their lemmas to a searchable list of dictionary entries, for a large historical dictionary entitled the Dictionary of the 17th- and 18th-century Polish, which comprizes 2.8 million index cards. We apply a tailored handwritten text recognition (HTR) solution that involves (1) an optimized detection model; (2) a recognition model to decipher the handwritten content, designed as a spatial transformer network (STN) followed by convolutional neural network (RCNN) with a connectionist temporal classification layer (CTC), trained using a synthetic set of 500,000 generated Polish words of different length; (3) a post-processing step using constrained Word Beam Search (WBC): the predictions were matched against a list of dictionary entries known in advance. Our model achieved the accuracy of 0.881 on the word level, which outperforms the base RCNN model. Within this study we produced a set of 20,000 manually annotated index cards that can be used for future benchmarks and transfer learning HTR applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/10/2023

Marginalia and machine learning: Handwritten text recognition for Marginalia Collections

The pressing need for digitization of historical document collections ha...
research
08/16/2023

Handwriting Analysis on the Diaries of Rosamond Jacob

Handwriting is an art form that most people learn at an early age. Each ...
research
07/10/2019

Fully Convolutional Networks for Handwriting Recognition

Handwritten text recognition is challenging because of the virtually inf...
research
06/22/2022

Connecting a French Dictionary from the Beginning of the 20th Century to Wikidata

The Petit Larousse illustré is a French dictionary first published in 19...
research
08/18/2023

A tailored Handwritten-Text-Recognition System for Medieval Latin

The Bavarian Academy of Sciences and Humanities aims to digitize its Med...
research
09/18/2019

Unsupervised Writer Adaptation for Synthetic-to-Real Handwritten Word Recognition

Handwritten Text Recognition (HTR) is still a challenging problem becaus...
research
12/22/2017

Emo, Love, and God: Making Sense of Urban Dictionary, a Crowd-Sourced Online Dictionary

The Internet facilitates large-scale collaborative projects. The emergen...

Please sign up or login with your details

Forgot password? Click here to reset