BanglaWriting, the dataset presented in this paper, aims to provide a preferable handwriting dataset that is enriched from every dimension. The dataset aims to solve word segmentation, writer identification, handwriting to text conversion, and even writer information (age and gender) extraction issues from one dataset. This dataset’s construction and usage are different from usual Bangla datasets [biswas2017banglalekha]. The currently available datasets for Bangla writing only include isolated character writings. Whereas, the BanglaWriting dataset contains word-based writing with bounding boxes. The dataset is implemented based on well-known offline handwriting and writer recognition datasets [marti2002iam].
The BanglaWriting dataset contains single-page handwritings of 260 individuals. The dataset includes 260 unique writers. It consists of 5,470 unique words and 124 unique characters. The overall dataset comprises 21,234 words and 32,787 characters in total. The dataset contains Bangla characters, numerics, diacritics, and conjuncts. Furthermore, it has punctuation marks and English alphabets mixed with Bangla writing. For better understanding, Fig. 1 explicates the underlying construction of the Bangla word.
2 Value of the Data
The dataset currently stands as the world’s first and largest openly accessible offline Bangla handwriting dataset.
This is the first approach that exploits all possible potentials of Bangla handwriting [marti2002iam]. The dataset contains bounding-box annotations for each handwritten word, unicode representation for each written word, and writer information for each document. Therefore, the dataset is suitable for word segmentation, optical character recognition, and writer identification.
The dataset contains raw images (without any pre-processing) of each document. The dataset also contains supplementary pre-processing scripts to suspend excess lighting and noises.
The dataset can be used to explore writing patterns related to age and gender.
|Subject||Computer Vision and Pattern Recognition|
|Specific subject area||Optical character recognition, word segmentation, writer identification|
|Type of data||Image
|How data were acquired||The images of the handwriting were captured using scanners and smartphone cameras. Each of the handwriting-images was cropped and annotated manually.|
|Data format||Raw data
|Scanner: HP Scanjet 2400
Smartphone camera: Xiaomi Redmi 6, Xiaomi Redmi 7
A single image contains the handwriting of an individual. Each individual is identified using age, gender, and unique person id. The handwritten words are segmented using bounding-boxes. Each of the bounding boxes contains the characters that are written. Labelme [labelme2016] software is used to draw and label the bounding-boxes.
|The writings were conducted using regular stationery products. Writers were advised to write on a random topic. Only one page of writing was collected from each individual. The handwritings were further captured using scanners and smartphone cameras. Each captured image was cropped and annotated manually.|
|Data source location||Institution: Bangladesh University of Business & Technology
Town: Dhaka and Kishoreganj
|Data accessibility||Repository name: Mendeley
Data identification number: 10.17632/r43wkvdk4w.1
Direct URL to data: https://data.mendeley.com/datasets/r43wkvdk4w/1
|Related research article||None|
4 Experimental Design, Materials and Methods
4.1 Data Collection
The dataset contains random handwriting of individuals. Fig. 2 illustrates two data examples. The first image is captured using a scanner, whereas the second image is captured using a smartphone camera. The dataset contains raw images to illustrate the real-world challenges of lightning. However, a script is also equipped with the dataset, that eliminates the lightning issues and improves the writing features. It is later discussed in Supplementary Script subsection.
The dataset was collected from the students of Bangladesh University of Business and Technology. Furthermore, to generate a better age distribution of the dataset, the students’ household members were also included. Fig. 3 illustrates the age and gender distribution of the population.
Each individual was suggested to write on random topics. Therefore, each document contains a diverse number of words. Fig. 4 represents the word distribution per document.
4.2 Data Preprocessing
Each image data was cropped and strengthened manually. The images were named using the formula, . The indicates a unique writer for each number. The variable indicates the age of the writer. The variable represents the writer is male if and female if .
4.3 Data Labeling
The dataset was manually annotated using labelme [labelme2016] software. Fig. 5 illustrates the word-based bounding boxes and the unicode-text labels for each bounding box. The figure also demonstrates the annotation policy adapted for overwriting and cropped words/characters. The comprehensible overwriting and cropped words were correctly labeled, and an asterisk (*) is added along the label to distinguish the words. However, incomprehensible and negated words were only marked using an asterisk.
The bounding-box and label information for each image was separately saved on individual JSON files. Fig. 6 illustrates the standard JSON-file parameters that were generated for each image. The "shape" property contains an array of "label" and "points" parameter pairs. The "label" parameter contains the written word (in unicode-8) in the bounding-box. Whereas, the "points" parameter contains an array of starting and ending pixel-coordinates of the bounding-box. The "imagePath", "imageHeight", and "imageWidth" contains some additional information such as, the filename of the corresponding image, the height and width of the image, respectively.
4.4 Supplementary Script
As the dataset contains raw images taken using scanners and smartphones, a difference of lightning and background noise is noticed (illustrated in Fig. 2). Hence, the dataset includes a supplementary Python [rossum1995python] and OpenCV [opencv_library] based script [banglawriting_script2020] that eliminates lightning issues and reduces the background noises. The script further furnishes the images and generates images suitable for machine learning and deep learning dataset. An example of the script’s output is illustrated in Fig. 7.
The authors would like to thank the Advanced Machine Learning (AML) lab and the Bangladesh University of Business and Technology (BUBT) for their resource sharing and precious suggestions.