Document Decomposition of Bangla Printed Text

01/27/2017
by   Md. Fahad Hasan, et al.
0

Today all kind of information is getting digitized and along with all this digitization, the huge archive of various kinds of documents is being digitized too. We know that, Optical Character Recognition is the method through which, newspapers and other paper documents convert into digital resources. But, it is a fact that this method works on texts only. As a result, if we try to process any document which contains non-textual zones, then we will get garbage texts as output. That is why; in order to digitize documents properly they should be prepossessed carefully. And while preprocessing, segmenting document in different regions according to the category properly is most important. But, the Optical Character Recognition processes available for Bangla language have no such algorithm that can categorize a newspaper/book page fully. So we worked to decompose a document into its several parts like headlines, sub headlines, columns, images etc. And if the input is skewed and rotated, then the input was also deskewed and de-rotated. To decompose any Bangla document we found out the edges of the input image. Then we find out the horizontal and vertical area of every pixel where it lies in. Later on the input image was cut according to these areas. Then we pick each and every sub image and found out their height-width ratio, line height. Then according to these values the sub images were categorized. To deskew the image we found out the skew angle and de skewed the image according to this angle. To de-rotate the image we used the line height, matra line, pixel ratio of matra line.

READ FULL TEXT
research
12/02/2016

Recognition of Text Image Using Multilayer Perceptron

The biggest challenge in the field of image processing is to recognize d...
research
09/06/2011

Devnagari document segmentation using histogram approach

Document segmentation is one of the critical phases in machine recogniti...
research
08/05/2020

Can You Read Me Now? Content Aware Rectification using Angle Supervision

The ubiquity of smartphone cameras has led to more and more documents be...
research
12/13/2014

A survey of modern optical character recognition techniques

This report explores the latest advances in the field of digital documen...
research
07/28/2017

FontCode: Embedding Information in Text Documents using Glyph Perturbation

We introduce FontCode, an information embedding technique for text docum...
research
06/25/2011

Morphological Reconstruction for Word Level Script Identification

A line of a bilingual document page may contain text words in regional l...
research
03/07/2015

An Improved Image Mosaicing Algorithm for Damaged Documents

It is a common phenomenon in day to day life; where in some of the docum...

Please sign up or login with your details

Forgot password? Click here to reset