Gender classification by handwriting is a well-studied problem, assuming that one’s gender can be predicted based on their handwriting. Although there has been a considerable amount of research on this subject, it is still considered a challenging problem. In fact, neither computerized analysis nor humans, have achieved highly-accurate results for this task, as of yet.
The common assumption is that various demographic properties can be learned by studying the discriminative features of a person’s handwriting, e.g., gender, handedness (i.e., whether the person is left-/right-handed), age bracket, ethnicity, etc. Indeed, human handwriting is used to examine and investigate human characteristics in a variety of applications, such as mail sorting [bouadjenek2014local], bank check verification [bandi2005writer, bouadjenek2014local], personality profiling [shackleton1994european, king2000illusory], historical document analysis [ahmed2017improving], and criminological/forensic investigations [bouadjenek2014local, bouadjenek2015age].
Most of the recent approaches to gender classification by handwriting have evolved mainly around the same few datasets, i.e., the training and testing of these methods have been confined typically to a handful of datasets, such as the IAM on-line [IAMOnDB], QUWI [al2012quwi], KHATT [mahmoud2014khatt], and MSHD [djeddi2014lamis] datasets. The motivation in this paper is mainly twofold: (1) Propose an improved gender classification method, and (2) augment the current pool of handwriting datasets in a significant manner. Specifically, we propose a new convolutional neural network (CNN) variant for the gender classification task, which is relatively simple, efficient, and accurate. Also, we present a fairly large and diverse dataset, the Hebrew-English Bar-Ilan University (HEBIU) offline handwriting dataset, which consists of 810 Hebrew and English handwriting samples, collected from a group of 405 participants. The newly-generated dataset would allow for extended research and comparative studies, regarding the classification of various attributes of interest. Our results are comparable to those reported by previous methods, and they are substantially better than the accuracy rates obtained by human examiners on our HEBIU dataset.
2 Related Work
Several machine learning techniques have been applied during the past two decades to the handwriting gender classification task. These approaches are based typically on feature extraction and training classifier; see Table1 below (extended from Gattal et al. [gattal2018gender]), for an overview.
|Cha et al. [cha2001priori] (2001)||A set of macro and micro features||ANN||CEDAR [hull1994database]||70.20%|
Liwicki et al. [liwicki2007automatic] (2011)
|Combination of online & offline features||GMM||IAM-OnDB [IAMOnDB]||65.57%|
Youssef et al. [youssef2013automated] (2013)
|Gradient & WD-LBP||SVM||QUWI [al2012quwi]||74.30%|
Al-Maadeed et al. [al2014automatic] (2014)
|Geometric||Random forests||QUWI [al2012quwi]||73%|
Bouadjenek et al. [bouadjenek2014local] (2014)
|HoG & LBP||SVM||IAM-OnDB [IAMOnDB]||74%|
Siddiqi et al. [siddiqi2015automatic] (2015)
|Orientation curvature & legibility||SVM||QUWI [al2012quwi] & MSHD [djeddi2014lamis]||68.75%/73.02%|
Mirza et al. [mirza2016gender] (2016)
Gabor filters & Fourier transform
Akbari et al. [akbari2017wavelet] (2017)
|Wavelet sub-hands||SVM/ANN||QUWI [al2012quwi] & MSHD [djeddi2014lamis]||80%|
Ahmed et al. [ahmed2017improving] (2017)
|Textural||Ensemble of classifiers||QUWI [al2012quwi]||79%–85%|
Gattal et al. [gattal2018gender] (2018)
|Oriented basic image features||SVM||QUWI [al2012quwi]||68%–76%|
Morera et al. [morera2018gender] (2018)
|Word seperation||CNN||IAM [IAMOnDB] & KHATT [mahmoud2014khatt]||80.72%/68.9%|
Cha et al. [cha2001priori] trained an artificial neural network (ANN) in order to classify demographic sub-categories (such as gender, handedness, and age group) by using their own uppercase letter dataset. Later, they extended their work [bandi2005writer]
to train a feed-forward neural network for feature extraction and classification, using enhancement techniques as bagging and boosting. Their improved gender classifier achieved an accuracy rate of 77.5% using 800 writing samples for training and 400 samples for testing.
Liwicki et al. [liwicki2007automatic] applied support vector machines (SVM) and Gaussian mixture models (GMM) to gender classification on the IAM-OnDB handwriting dataset. Their classifier achieved accuracy rates of 62% and 67%, respectively, using SVM and GMM.
Youssef et al. [youssef2013automated] proposed using wavelet domain local binary patterns (WD-LBP) to train several SVM classifiers on both English and Arabic handwritings. Their classifier achieved an accuracy rate of 74.3% on (a subset of) the QUWI dataset.
Al-Maadeed et al. [al2014automatic] proposed using geometric features to classify age, gender, and nationality. Their proposed method applies random forests and kernel discriminant analysis for both text-dependent and text-independent classifications (i.e., same/different texts, respectively, of different writers are used for training and testing). Their classifier achieved an overall accuracy of 73% on the QUWI dataset.
Bouadjenek et al. [bouadjenek2014local] proposed extracting local descriptors, such as histogram of oriented gradients (HoG), local binary patterns (LBP), and grid features for offline handwriting, and then classifying them by SVM. Their method achieved an accuracy rate of 74% on the IAM offline dataset. Likewise, Bouadjenek et al. [bouadjenek2015age] used local descriptors, such as gradient local binary patterns (GLBP) and HoG to train an SVM classifier to predict age, gender, and handedness. Their classifier achieved accuracy rates in the range of 69%–74% on the IAM-OnDB and KHATT datasets.
Similarly, Siddiqi et al. [siddiqi2015automatic] enhanced handwriting features by computing local and global features (e.g., inclination, texture, curvature, legibility, etc.), which are then used in ANN and SVM classifiers to distinguish between genders. Their classifier achieved accuracy rates of 68.75% and 73.02%, respectively, on the QUWI and MSHD datasets.
Mirza et al. [mirza2016gender] concentrated on the visual appearance of handwriting to investigate its effect on a writer’s gender. They extract textural information by applying a bank of Gabor filters
to handwriting images from the QUWI dataset. They then use the mean and standard deviation of each handwriting plus its Fourier transform as input features for a feed-forward neural network. Their classifier achieved an accuracy rate of 70% on the QUWI dataset.
Akbari et al. [akbari2017wavelet] extracted a feature vector based on a series of wavelet sub-bands quantized to produce a probabilistic finite state automaton. This feature vector is then used to train ANN and SVM classifiers on the QUWI and MSHD datasets, and perform text-dependent and text-independent, as well as script-dependent and script-independent classifications (i.e., same/different languages, respectively, used for training and testing). They also introduced cross-database evaluations.
To enhance accuracy rates on the gender task, Ahmed et al. [ahmed2017improving] used bagging, voting, and stacking of various classifiers based on some of the textural features mentioned earlier. They achieved accuracy rates in the range of 79%–85% on (a subset of) the QUWI dataset.
Gattal et al. [gattal2018gender]
proposed using textural information from handwriting as the discriminative attribute between genders. They used image binarization andoriented basic image features. Their classifier achieved accuracy rates of 71%, 76%, and 68% on the QUWI dataset, according to the protocols of ICDAR 2013, ICDAR 2015, and ICFHR 2016, respectively.
Finally, Morera et al. [morera2018gender] were the first to apply a deep CNN for classifying a writer’s demographics. They proposed the same architecture for both gender and handedness, as well as an architecture for the combined 4-class problem. Their gender classifier achieved accuracy rates of 80.72% and 68.9%, respectively, on the IAM-OnDB and KHATT datasets.
To summarize, most of the surveyed methods exploit knowledge about the domain to extract certain features from the above datasets, and then train a machine learning module to classify these extracted features. In contrast, we present in this work a deep learning module, which performs essentially automated feature extraction and classification, in a rather simple and efficient manner (requires no tedious preprocessing, and is far less complex than the system reported, e.g., by Morera et al. [morera2018gender]).
3 Proposed Method
3.1 The HEBIU Offline Handwriting Dataset
Our newly generated dataset, the Hebrew-English Bar-Ilan University (HEBIU) offline handwriting dataset, contains 810 Hebrew and English handwriting samples of 405 participants from Israel. Each participant received a standard form, and was asked to write certain texts in Hebrew and English without any writing restrictions (e.g., pen type, pressure, etc.). In addition, each contributer was asked to provide personal data, such as gender, age, height, handedness, native language, country of birth, religion, education level, and profession.
Each such form was scanned by a 300dpi HP OfficeJet Pro 8710, in color mode and JPEG format, at a high resolution of .
The added value of our newly presented HEBIU dataset lies in the fact that it contains (also) hundreds of labeled writing samples in Hebrew, as well as diverse personal information per each participant. Thus, additional tasks, such as writer identification/verification and the classification of various demographic characteristics from handwriting samples, can be further pursued with such data.
3.2 Handwriting Preprocessing
As previously mentioned, our HEBIU dataset contains 810 Hebrew and English handwriting samples of 214 males and 191 females (i.e., of a total of 405 participants). Thus, to keep the data balanced, we excluded from the dataset, as part of preprocessing, 23 of the male forms.
In addition, the data should be normalized to be compatible with the network’s architecture. Therefore, the first step was to extract a portion of the page which contains handwritten text, and convert it to a grayscale image. Afterwards, in order to enhance our data, we generated random patches for each form, of size , with (possible) overlaps between patches. A patch can be either a square or a rectangle. A square patch is meant to extract a whole subsection of words, while a rectangular patch is used to extract a line of text (or part of it), a single word, a writing sequence, etc. Both cases are illustrated in Fig. 1.
Having experimented extensively with the number of patches, as well as patch types and patch sizes, we converged eventually on patches per handwritten sample and squared patches of size pixels (i.e., ). To keep the computational effort feasible, the patches were downscaled by 75% to pixels. (Similarly, the originally extracted rectangular patches of size were downscaled to .)
Naturally, some of the generated patches were blank or contained small amounts of data. To overcome the selection of sparse text patches, we conducted a series of experiments to determine a threshold, based on a minimum ratio between black pixels and the total amount of pixels in a given patch. This was then used to select patches which contained a sufficient amount of data. Note that eventually we extracted 200 valid patches per each form.
3.3 Network Architecture
Our proposed network architecture is a CNN variant which inputs a grayscale, patch and outputs the gender prediction. It is comprised of a total of four convolutional layers, followed by a single fully-connected layer and a softmax output layer, where all of the filters used are of size
. More precisely, the first two layers consist of 64 and 128 filters, respectively, followed by a max pooling layer ofwith a dropout of 0.4. The next two layers have the same structure, followed by a rectified linear unitnair2010rectified], an Adadelta
optimizer, and a binary cross entropy loss function.
3.4 Accuracy Evaluation by Patch Aggregation
We considered the following two classification measures, for a given handwriting sample:
Majority vote: The gender class is determined based on the majority of classified patches, where the classification of each patch depends on whether the corresponding softmax value exceeds 0.5.
Average softmax: The form is classified according to the average softmax value over the form’s 200 patches.
4 Experimental Results
We divided the gender classification problem, in the context of this work, into three main types: (1) Intra-language classification, where training and testing are conducted on the same language, (2) inter-language classification, where training is conducted on one language and testing on the other, and (3) mixed language classification, where both training and testing are conducted on both languages. For each type, we ran a 10-fold cross validation as follows. A fixed 20% of the data (i.e., the same 76 forms) were set aside for testing, and 70% (i.e., 268 forms) and 10% (i.e., 38 forms) of the data, respectively, were allocated at random (from the remaining 80%) for training and validation.
4.1 Intra-Language Classification
Regarding intra-language classification, we obtained average accuracy rates of 73.02% and 75.26%, respectively, in the case of Hebrew-Hebrew (i.e., training and testing performed on Hebrew texts) and English-English (i.e., both training and testing done on English texts).
4.2 Inter-Language Classification
For inter-language classification, we achieved accuracy rates of 75.65% and 58.29%, respectively, in the case of Hebrew-English classification (i.e., training on a Hebrew handwriting and testing on an English one) and English-Hebrew classification (i.e., training on an English handwriting and testing on a Hebrew one).
One attempt to explain this anomaly might be that since English is a second natural language in Israel (after Hebrew), the discriminative features between gender handwritings are less prominent (than in Hebrew), so generalizing becomes more challenging.
4.3 Mixed Language Classification
Enhancing our data by combining the texts of both languages yields an overall test accuracy of 77% for both languages; in particular, 74.61% and 79.34% accuracy rates when tested on Hebrew and English texts, respectively.
4.4 Summary of Results
Table 2 summarizes the results, providing average accuracy rates and standard deviations for each method.
4.5 Human Test Results
In order to compare our results with those of human examiners, we developed a mobile application that tests the accuracy of humans on the same task. The application was distributed among 153 females and 147 males; each of the 300 participants received 15 Hebrew handwritings and 15 English handwritings chosen at random (from our HEBIU dataset), and was asked to predict the writer’s gender of each examined text. The average classification accuracy for English and Hebrew handwritings were 63.6% and 66.2%, respectively (both with a standard deviation of 0.13). Females achieved slightly better results than males in both cases. Specifically, they obtained an accuracy of 64.8% (vs. 62.2%) for English, and an accuracy of 67.4% (vs. 65%) for Hebrew. No correlations between the accuracy and either age group or education level were observed.
5 Concluding Remarks
In this paper, we proposed an automatic deep learning scheme for binary gender classification from handwriting images. Specifically, we presented a CNN variant for this task without “manual” feature selection/extraction. Our module is relatively simple, yet efficient, in terms of training speed and running time. We considered seven cross-language cases, including training on a Semitic language (Hebrew) and validation on a non-Semitic one (English), and vice versa. Our classification results are comparable to those of previous methods, and are significantly better than those obtained by human examiners on the same dataset.
In addition, we presented a new offline handwriting dataset (the HEBIU dataset), which contains hundreds of labeled handwriting samples in both Hebrew and English, including diverse demographic information.
Our future work will focus on predicting additional attributes of a given writer, e.g., handedness, age group, whether the text is written in the subject’s mother tongue, etc. In addition, we plan to apply our approach to other existing handwriting datasets and aim to enlarge our dataset by collecting more handwriting samples, possibly in additional languages.