BanglaWriting: A multi-purpose offline Bangla handwriting dataset

11/15/2020 ∙ by M. F. Mridha, et al. ∙ 0

This article presents a Bangla handwriting dataset named BanglaWriting that contains single-page handwritings of 260 individuals of different personalities and ages. Each page includes bounding-boxes that bounds each word, along with the unicode representation of the writing. This dataset contains 21,234 words and 32,787 characters in total. Moreover, this dataset includes 5,470 unique words of Bangla vocabulary. Apart from the usual words, the dataset comprises 261 comprehensible overwriting and 450 incomprehensible overwriting. All of the bounding boxes and word labels are manually-generated. The dataset can be used for complex optical character/word recognition, writer identification, and handwritten word segmentation. Furthermore, this dataset is suitable for extracting age-based and gender-based variation of handwriting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

BanglaWriting, the dataset presented in this paper, aims to provide a preferable handwriting dataset that is enriched from every dimension. The dataset aims to solve word segmentation, writer identification, handwriting to text conversion, and even writer information (age and gender) extraction issues from one dataset. This dataset’s construction and usage are different from usual Bangla datasets [biswas2017banglalekha]. The currently available datasets for Bangla writing only include isolated character writings. Whereas, the BanglaWriting dataset contains word-based writing with bounding boxes. The dataset is implemented based on well-known offline handwriting and writer recognition datasets [marti2002iam].

Figure 1: Graphemes are the smallest unit of meaningful writing. A grapheme always contains a grapheme root. In the Bangla writing system, a grapheme may have one vowel and one consonant diacritic. Occasionally, a grapheme may include consonant conjuncts as it’s grapheme root.

The BanglaWriting dataset contains single-page handwritings of 260 individuals. The dataset includes 260 unique writers. It consists of 5,470 unique words and 124 unique characters. The overall dataset comprises 21,234 words and 32,787 characters in total. The dataset contains Bangla characters, numerics, diacritics, and conjuncts. Furthermore, it has punctuation marks and English alphabets mixed with Bangla writing. For better understanding, Fig. 1 explicates the underlying construction of the Bangla word.

2 Value of the Data

  • The dataset currently stands as the world’s first and largest openly accessible offline Bangla handwriting dataset.

  • The dataset is suitable for machine learning

    [michie1994machine]

    models, deep learning

    [lecun2015deep]

    models, producing embedding vectors

    [ohi2020autoembedder] of handwriting, etc.

  • This is the first approach that exploits all possible potentials of Bangla handwriting [marti2002iam]. The dataset contains bounding-box annotations for each handwritten word, unicode representation for each written word, and writer information for each document. Therefore, the dataset is suitable for word segmentation, optical character recognition, and writer identification.

  • The dataset contains raw images (without any pre-processing) of each document. The dataset also contains supplementary pre-processing scripts to suspend excess lighting and noises.

  • The dataset can be used to explore writing patterns related to age and gender.

3 Specifications

Subject Computer Vision and Pattern Recognition
Specific subject area Optical character recognition, word segmentation, writer identification
Type of data Image
JSON
How data were acquired The images of the handwriting were captured using scanners and smartphone cameras. Each of the handwriting-images was cropped and annotated manually.
Data format Raw data
Converted data
Annotations
Parameters for data
collection
Scanner: HP Scanjet 2400
Smartphone camera: Xiaomi Redmi 6, Xiaomi Redmi 7
A single image contains the handwriting of an individual. Each individual is identified using age, gender, and unique person id. The handwritten words are segmented using bounding-boxes. Each of the bounding boxes contains the characters that are written. Labelme [labelme2016] software is used to draw and label the bounding-boxes.
Description of data
collection
The writings were conducted using regular stationery products. Writers were advised to write on a random topic. Only one page of writing was collected from each individual. The handwritings were further captured using scanners and smartphone cameras. Each captured image was cropped and annotated manually.
Data source location Institution: Bangladesh University of Business & Technology
Town: Dhaka and Kishoreganj
Country: Bangladesh
Data accessibility Repository name: Mendeley
Data identification number: 10.17632/r43wkvdk4w.1
Direct URL to data: https://data.mendeley.com/datasets/r43wkvdk4w/1
Related research article None
Figure 2: The left image is obtained using a scanner, whereas the right image is evoked using a smartphone camera. The color variation of the image extraction domain is massive. The scanned images mostly contain a glaring effect. Furthermore, taking handwriting images using a smartphone camera causes light-issues.

4 Experimental Design, Materials and Methods

4.1 Data Collection

The dataset contains random handwriting of individuals. Fig. 2 illustrates two data examples. The first image is captured using a scanner, whereas the second image is captured using a smartphone camera. The dataset contains raw images to illustrate the real-world challenges of lightning. However, a script is also equipped with the dataset, that eliminates the lightning issues and improves the writing features. It is later discussed in Supplementary Script subsection.

The dataset was collected from the students of Bangladesh University of Business and Technology. Furthermore, to generate a better age distribution of the dataset, the students’ household members were also included. Fig. 3 illustrates the age and gender distribution of the population.

Figure 3: The left graph exhibits age distribution, and the right graph demonstrates the gender distribution of the dataset.

Each individual was suggested to write on random topics. Therefore, each document contains a diverse number of words. Fig. 4 represents the word distribution per document.

Figure 4:

The left graph illustrates the word per document distribution for each paper. The right graph shows the same scenario without outliers. The word-count histogram simulates normal distribution.

4.2 Data Preprocessing

Each image data was cropped and strengthened manually. The images were named using the formula, . The indicates a unique writer for each number. The variable indicates the age of the writer. The variable represents the writer is male if and female if .

Figure 5: The left image illustrates a handwriting image with bounding-boxes. The labels/words for each bounding box is presented on the right. The excluded word (second row, second word) is marked using an asterisk (*).

4.3 Data Labeling

The dataset was manually annotated using labelme [labelme2016] software. Fig. 5 illustrates the word-based bounding boxes and the unicode-text labels for each bounding box. The figure also demonstrates the annotation policy adapted for overwriting and cropped words/characters. The comprehensible overwriting and cropped words were correctly labeled, and an asterisk (*) is added along the label to distinguish the words. However, incomprehensible and negated words were only marked using an asterisk.

1{
2        "shapes": [
3        {"label": "wordLabel",
4                "points": [[xmin, ymin], [xmax, ymax]]
5        },
6        {"label": "wordLabel",
7                "points": [[xmin, ymin], [xmax, ymax]]
8        },
9        ....
10        ],
11        "imagePath": "uniquePersonIdentifier_age_gender.jpg",
12        "imageHeight": Xpx,
13        "imageWidth": Ypx
14}
Figure 6: The figure illustrates a JSON structure that interprets the bounding-boxes and labels information for each handwriting image data.

The bounding-box and label information for each image was separately saved on individual JSON files. Fig. 6 illustrates the standard JSON-file parameters that were generated for each image. The "shape" property contains an array of "label" and "points" parameter pairs. The "label" parameter contains the written word (in unicode-8) in the bounding-box. Whereas, the "points" parameter contains an array of starting and ending pixel-coordinates of the bounding-box. The "imagePath", "imageHeight", and "imageWidth" contains some additional information such as, the filename of the corresponding image, the height and width of the image, respectively.

Figure 7: The images presented in the left column illustrate the raw images, and the right column displays the script’s output image. The supplementary script eliminates lightning and glaring issues and generates better textual patterns of the images.

4.4 Supplementary Script

As the dataset contains raw images taken using scanners and smartphones, a difference of lightning and background noise is noticed (illustrated in Fig. 2). Hence, the dataset includes a supplementary Python [rossum1995python] and OpenCV [opencv_library] based script [banglawriting_script2020] that eliminates lightning issues and reduces the background noises. The script further furnishes the images and generates images suitable for machine learning and deep learning dataset. An example of the script’s output is illustrated in Fig. 7.

Acknowledgments

The authors would like to thank the Advanced Machine Learning (AML) lab and the Bangladesh University of Business and Technology (BUBT) for their resource sharing and precious suggestions.

References