Handwritten and Printed Text Segmentation: A Signature Case Study

by   Sina Gholamian, et al.

While analyzing scanned documents, handwritten text can overlay printed text. This causes difficulties during the optical character recognition (OCR) and digitization process of documents, and subsequently, hurts downstream NLP tasks. Prior research either focuses only on the binary classification of handwritten text, or performs a three-class segmentation of the document, i.e., recognition of handwritten, printed, and background pixels. This results in the assignment of the handwritten and printed overlapping pixels to only one of the classes, and thus, they are not accounted for in the other class. Thus, in this research, we develop novel approaches for addressing the challenges of handwritten and printed text segmentation with the goal of recovering text in different classes in whole, especially improving the segmentation performance on the overlapping parts. As such, to facilitate with this task, we introduce a new dataset, SignaTR6K, collected from real legal documents, as well as a new model architecture for handwritten and printed text segmentation task. Our best configuration outperforms the prior work on two different datasets by 17.9 7.3


page 3

page 8

page 12

page 16

page 17


StackMix and Blot Augmentations for Handwritten Text Recognition

This paper proposes a handwritten text recognition(HTR) system that outp...

Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches

Although abbreviations are fairly common in handwritten sources, particu...

Evaluation of a Region Proposal Architecture for Multi-task Document Layout Analysis

Automatically recognizing the layout of handwritten documents is an impo...

TMIXT: A process flow for Transcribing MIXed handwritten and machine-printed Text

Handling large corpuses of documents is of significant importance in man...

TransDocAnalyser: A Framework for Offline Semi-structured Handwritten Document Analysis in the Legal Domain

State-of-the-art offline Optical Character Recognition (OCR) frameworks ...

EASTER: Efficient and Scalable Text Recognizer

Recent progress in deep learning has led to the development of Optical C...

Open Set Classification of Untranscribed Handwritten Documents

Huge amounts of digital page images of important manuscripts are preserv...

Please sign up or login with your details

Forgot password? Click here to reset