docExtractor: An off-the-shelf historical document element extraction

12/15/2020
by   Tom Monnier, et al.
0

We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents without requiring any real data annotation. We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets and leads to results on par with state-of-the-art when fine-tuned. We argue that the performance obtained without fine-tuning on a specific dataset is critical for applications, in particular in digital humanities, and that the line-level page segmentation we address is the most relevant for a general purpose element extraction engine. We rely on a fast generator of rich synthetic documents and design a fully convolutional network, which we show to generalize better than a detection-based approach. Furthermore, we introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.

READ FULL TEXT

page 3

page 4

research
04/10/2007

Text Line Segmentation of Historical Documents: a Survey

There is a huge amount of historical documents in libraries and in vario...
research
03/23/2022

Robust Text Line Detection in Historical Documents: Learning and Evaluation Methods

Text line segmentation is one of the key steps in historical document un...
research
06/07/2017

Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Network

We present an end-to-end, multimodal, fully convolutional network for ex...
research
09/05/2017

PageNet: Page Boundary Extraction in Historical Handwritten Documents

When digitizing a document into an image, it is common to include a surr...
research
09/04/2023

Prompt me a Dataset: An investigation of text-image prompting for historical image dataset creation using foundation models

In this paper, we present a pipeline for image extraction from historica...
research
09/05/2019

Deep Visual Template-Free Form Parsing

Automatic, template-free extraction of information from form images is c...
research
04/27/2018

dhSegment: A generic deep-learning approach for document segmentation

In recent years there have been multiple successful attempts tackling do...

Please sign up or login with your details

Forgot password? Click here to reset