Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

03/16/2021
by   Pit Schneider, et al.
0

Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or curved text lines. For that reason, the segmenter in question could be of particular interest for cultural institutions, such as libraries, archives, museums, ..., that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.

READ FULL TEXT

page 3

page 4

page 5

page 6

page 8

research
08/29/2014

Text Line Identification in Tagore's Manuscript

In this paper, a text line identification method is proposed. The text l...
research
04/10/2007

Text Line Segmentation of Historical Documents: a Survey

There is a huge amount of historical documents in libraries and in vario...
research
09/06/2011

Devnagari document segmentation using histogram approach

Document segmentation is one of the critical phases in machine recogniti...
research
02/03/2023

The Learnable Typewriter: A Generative Approach to Text Line Analysis

We present a generative document-specific approach to character analysis...
research
05/11/2023

Combining OCR Models for Reading Early Modern Printed Books

In this paper, we investigate the usage of fine-grained font recognition...
research
05/19/2021

Unsupervised learning of text line segmentation by differentiating coarse patterns

Despite recent advances in the field of supervised deep learning for tex...
research
10/04/2021

Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction

Iterating with new and improved OCR solutions enforces decisions to be t...

Please sign up or login with your details

Forgot password? Click here to reset