Automatic Page Segmentation Without Decompressing the Run-Length Compressed Text Documents

07/02/2020
by   Mohammed Javed, et al.
0

Page segmentation is considered to be the crucial stage for the automatic analysis of documents with complex layouts. This has traditionally been carried out in uncompressed documents, although most of the documents in real life exist in a compressed form warranted by the requirement to make storage and transfer efficient. However, carrying out page segmentation directly in compressed documents without going through the stage of decompression is a challenging goal. This research paper proposes demonstrating the possibility of carrying out a page segmentation operation directly in the run-length data of the CCITT Group-3 compressed text document, which could be single- or multi-columned and might even have some text regions in the inverted text color mode. Therefore, before carrying out the segmentation of the text document into columns, each column into paragraphs, each paragraph into text lines, each line into words, and, finally, each word into characters, a pre-processing of the text document needs to be carried out. The pre-processing stage identifies the normal text regions and inverted text regions, and the inverted text regions are toggled to the normal mode. In the sequel to initiate column separation, a new strategy of incremental assimilation of white space runs in the vertical direction and the auto-estimation of certain related parameters is proposed. A procedure to realize column-segmentation employing these extracted parameters has been devised. Subsequently, what follows first is a two-level horizontal row separation process, which segments every column into paragraphs, and in turn, into text-lines. Then, there is a two-level vertical column separation process, which completes the separation into words and characters.

READ FULL TEXT

page 4

page 5

page 8

page 10

page 12

page 13

page 15

research
10/11/2014

Direct Processing of Document Images in Compressed Domain

With the rapid increase in the volume of Big data of this digital era, f...
research
04/15/2020

An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers

One important and particularly challenging step in the optical character...
research
03/03/2013

Genetic Programming for Document Segmentation and Region Classification Using Discipulus

Document segmentation is a method of rending the document into distinct ...
research
09/13/2022

OCR for TIFF Compressed Document Images Directly in Compressed Domain Using Text segmentation and Hidden Markov Model

In today's technological era, document images play an important and inte...
research
07/29/2019

Automatic Text Line Segmentation Directly in JPEG Compressed Document Images

JPEG is one of the popular image compression algorithms that provide eff...
research
12/17/2004

Line and Word Matching in Old Documents

This paper is concerned with the problem of establishing an index based ...

Please sign up or login with your details

Forgot password? Click here to reset