Digital Editions as Distant Supervision for Layout Analysis of Printed Books

12/23/2021
by   Alejandro H. Toselli, et al.
0

Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents' semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2022

Page Layout Analysis of Text-heavy Historical Documents: a Comparison of Textual and Visual Approaches

Page layout analysis is a fundamental step in document processing which ...
research
03/23/2022

Robust Text Line Detection in Historical Documents: Learning and Evaluation Methods

Text line segmentation is one of the key steps in historical document un...
research
02/14/2020

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

The massive amounts of digitized historical documents acquired over the ...
research
09/17/2021

Including Keyword Position in Image-based Models for Act Segmentation of Historical Registers

The segmentation of complex images into semantic regions has seen a grow...
research
06/12/2023

Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs

Every day, thousands of digital documents are generated with useful info...
research
04/15/2020

An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers

One important and particularly challenging step in the optical character...
research
02/16/2022

Processing the structure of documents: Logical Layout Analysis of historical newspapers in French

Background. In recent years, libraries and archives led important digiti...

Please sign up or login with your details

Forgot password? Click here to reset