Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio -- Episode 1: Machine Transcription of the Manuscripts

03/08/2018
by   Donatella Firmani, et al.
0

In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters and language models to compose word transcriptions. Our approach requires minimal training efforts, making the transcription process more scalable as the production of training sets requires a few pages and can be easily crowdsourced. We have conducted experiments on manuscripts from the Vatican Registers, an unreleased corpus containing the correspondence of the popes. With training data produced by 120 high school students, our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speedup the transcription process at a large scale.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/23/2019

Handwritten Amharic Character Recognition Using a Convolutional Neural Network

Amharic is the official language of the Federal Democratic Republic of E...
research
07/30/2023

Optimizing the Neural Network Training for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based documents such ...
research
04/26/2023

SIMARA: a database for key-value information extraction from full pages

We propose a new database for information extraction from historical han...
research
10/09/2018

Decipherment of Historical Manuscript Images

European libraries and archives are filled with enciphered manuscripts f...
research
06/20/2019

Pattern Spotting in Historical Documents Using Convolutional Models

Pattern spotting consists of searching in a collection of historical doc...
research
02/26/2023

User-Centric Evaluation of OCR Systems for Kwak'wala

There has been recent interest in improving optical character recognitio...
research
01/02/2019

Lipi Gnani - A Versatile OCR for Documents in any Language Printed in Kannada Script

A Kannada OCR, named Lipi Gnani, has been designed and developed from sc...

Please sign up or login with your details

Forgot password? Click here to reset