uTHCD: A New Benchmarking for Tamil Handwritten OCR

03/13/2021
by   Noushath Shaffi, et al.
4

Handwritten character recognition is a challenging research in the field of document image analysis over many decades due to numerous reasons such as large writing styles variation, inherent noise in data, expansive applications it offers, non-availability of benchmark databases etc. There has been considerable work reported in literature about creation of the database for several Indic scripts but the Tamil script is still in its infancy as it has been reported only in one database [5]. In this paper, we present the work done in the creation of an exhaustive and large unconstrained Tamil Handwritten Character Database (uTHCD). Database consists of around 91000 samples with nearly 600 samples in each of 156 classes. The database is a unified collection of both online and offline samples. Offline samples were collected by asking volunteers to write samples on a form inside a specified grid. For online samples, we made the volunteers write in a similar grid using a digital writing pad. The samples collected encompass a vast variety of writing styles, inherent distortions arising from offline scanning process viz stroke discontinuity, variable thickness of stroke, distortion etc. Algorithms which are resilient to such data can be practically deployed for real time applications. The samples were generated from around 650 native Tamil volunteers including school going kids, homemakers, university students and faculty. The isolated character database will be made publicly available as raw images and Hierarchical Data File (HDF) compressed file. With this database, we expect to set a new benchmark in Tamil handwritten character recognition and serve as a launchpad for many avenues in document image analysis domain. Paper also presents an ideal experimental set-up using the database on convolutional neural networks (CNN) with a baseline accuracy of 88

READ FULL TEXT

page 5

page 6

page 12

page 20

page 23

page 28

page 29

page 30

research
08/17/2013

Development of Comprehensive Devnagari Numeral and Character Database for Offline Handwritten Character Recognition

In handwritten character recognition, benchmark database plays an import...
research
06/05/2022

Two Decades of Bengali Handwritten Digit Recognition: A Survey

Handwritten Digit Recognition (HDR) is one of the most challenging tasks...
research
06/30/2010

Classification Of Gradient Change Features Using MLP For Handwritten Character Recognition

A novel, generic scheme for off-line handwritten English alphabets chara...
research
06/28/2020

Offline Handwritten Chinese Text Recognition with Convolutional Neural Networks

Deep learning based methods have been dominating the text recognition ta...
research
03/30/2010

Development of a multi-user handwriting recognition system using Tesseract open source OCR engine

The objective of the paper is to recognize handwritten samples of lower ...
research
12/24/2018

Writer-Aware CNN for Parsimonious HMM-Based Offline Handwritten Chinese Text Recognition

Recently, the hybrid convolutional neural network hidden Markov model (C...
research
03/30/2010

Recognition of Handwritten Textual Annotations using Tesseract Open Source OCR Engine for information Just In Time (iJIT)

Objective of the current work is to develop an Optical Character Recogni...

Please sign up or login with your details

Forgot password? Click here to reset