PageNet: Page Boundary Extraction in Historical Handwritten Documents

09/05/2017
by   Chris Tensmeyer, et al.
0

When digitizing a document into an image, it is common to include a surrounding border region to visually indicate that the entire document is present in the image. However, this border should be removed prior to automated processing. In this work, we present a deep learning based system, PageNet, which identifies the main page region in an image in order to segment content from both textual and non-textual border noise. In PageNet, a Fully Convolutional Network obtains a pixel-wise segmentation which is post-processed into the output quadrilateral region. We evaluate PageNet on 4 collections of historical handwritten documents and obtain over 94 union on all datasets and approach human performance on 2 of these collections. Additionally, we show that PageNet can segment documents that are overlayed on top of other documents.

READ FULL TEXT

page 1

page 2

page 3

page 5

page 6

research
10/02/2021

Asking questions on handwritten document collections

This work addresses the problem of Question Answering (QA) on handwritte...
research
06/07/2017

Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Network

We present an end-to-end, multimodal, fully convolutional network for ex...
research
10/10/2017

DocEmul: a Toolkit to Generate Structured Historical Documents

We propose a toolkit to generate structured synthetic documents emulatin...
research
01/27/2021

HDIB1M – Handwritten Document Image Binarization 1 Million Dataset

Handwritten document image binarization is a challenging task due to hig...
research
03/22/2017

Neural Ctrl-F: Segmentation-free Query-by-String Word Spotting in Handwritten Manuscript Collections

In this paper, we approach the problem of segmentation-free query-by-str...
research
04/27/2018

dhSegment: A generic deep-learning approach for document segmentation

In recent years there have been multiple successful attempts tackling do...
research
12/15/2020

docExtractor: An off-the-shelf historical document element extraction

We present docExtractor, a generic approach for extracting visual elemen...

Please sign up or login with your details

Forgot password? Click here to reset