Autonomous Cleaning of Corrupted Scanned Documents - A Generative Modeling Approach

01/12/2012
by   Zhenwen Dai, et al.
0

We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink etc. We aim at autonomously removing dirt from a single letter-size page based only on the information the page contains. Our approach, therefore, has to learn character representations without supervision and requires a mechanism to distinguish learned representations from irregular patterns. To learn character representations, we use a probabilistic generative model parameterizing pattern features, feature variances, the features' planar arrangements, and pattern frequencies. The latent variables of the model describe pattern class, pattern position, and the presence or absence of individual pattern features. The model parameters are optimized using a novel variational EM approximation. After learning, the parameters represent, independently of their absolute position, planar feature arrangements and their variances. A quality measure defined based on the learned representation then allows for an autonomous discrimination between regular character patterns and the irregular patterns making up the dirt. The irregular patterns can thus be removed to clean the document. For a full Latin alphabet we found that a single page does not contain sufficiently many character examples. However, even if heavily corrupted by dirt, we show that a page containing a lower number of character types can efficiently and autonomously be cleaned solely based on the structural regularity of the characters it contains. In different examples using characters from different alphabets, we demonstrate generality of the approach and discuss its implications for future developments.

READ FULL TEXT

page 2

page 5

research
05/06/2020

Automated Transcription for Pre-Modern Japanese Kuzushiji Documents by Random Lines Erasure and Curriculum Learning

Recognizing the full-page of Japanese historical documents is a challeng...
research
01/13/2020

Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild

In this work we propose to improve text recognition from a new perspecti...
research
05/10/2018

The Evolution of Popularity and Images of Characters in Marvel Cinematic Universe Fanfictions

This analysis proposes a new topic model to study the yearly trends in M...
research
10/21/2019

CNN based Extraction of Panels/Characters from Bengali Comic Book Page Images

Peoples nowadays prefer to use digital gadgets like cameras or mobile ph...
research
09/02/2017

Patterns versus Characters in Subword-aware Neural Language Modeling

Words in some natural languages can have a composite structure. Elements...
research
05/31/2018

Forgetting Memories and their Attractiveness

We study numerically the memory which forgets, introduced in 1986 by Par...

Please sign up or login with your details

Forgot password? Click here to reset