Handwriting-Based Gender Classification Using End-to-End Deep Neural Networks

Handwriting-based gender classification is a well-researched problem that has been approached mainly by traditional machine learning techniques. In this paper, we propose a novel deep learning-based approach for this task. Specifically, we present a convolutional neural network (CNN), which performs automatic feature extraction from a given handwritten image, followed by classification of the writer's gender. Also, we introduce a new dataset of labeled handwritten samples, in Hebrew and English, of 405 participants. Comparing the gender classification accuracy on this dataset against human examiners, our results show that the proposed deep learning-based approach is substantially more accurate than that of humans.



There are no comments yet.


page 1

page 2

page 3

page 4


DeepImageSpam: Deep Learning based Image Spam Detection

Hackers and spammers are employing innovative and novel techniques to de...

Dyadic Sex Composition and Task Classification Using fNIRS Hyperscanning Data

Hyperscanning with functional near-infrared spectroscopy (fNIRS) is an e...

Region extraction based approach for cigarette usage classification using deep learning

This paper has proposed a novel approach to classify the subjects' smoki...

Sex-Prediction from Periocular Images across Multiple Sensors and Spectra

In this paper, we provide a comprehensive analysis of periocular-based s...

DeepFreak: Learning Crystallography Diffraction Patterns with Automated Machine Learning

Serial crystallography is the field of science that studies the structur...

A Deep Learning Based Ternary Task Classification System Using Gramian Angular Summation Field in fNIRS Neuroimaging Data

Functional near-infrared spectroscopy (fNIRS) is a non-invasive, economi...

Exploring difference in public perceptions on HPV vaccine between gender groups from Twitter using deep learning

In this study, we proposed a convolutional neural network model for gend...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gender classification by handwriting is a well-studied problem, assuming that one’s gender can be predicted based on their handwriting. Although there has been a considerable amount of research on this subject, it is still considered a challenging problem. In fact, neither computerized analysis nor humans, have achieved highly-accurate results for this task, as of yet.

The common assumption is that various demographic properties can be learned by studying the discriminative features of a person’s handwriting, e.g., gender, handedness (i.e., whether the person is left-/right-handed), age bracket, ethnicity, etc. Indeed, human handwriting is used to examine and investigate human characteristics in a variety of applications, such as mail sorting [bouadjenek2014local], bank check verification [bandi2005writer, bouadjenek2014local], personality profiling [shackleton1994european, king2000illusory], historical document analysis [ahmed2017improving], and criminological/forensic investigations [bouadjenek2014local, bouadjenek2015age].

Most of the recent approaches to gender classification by handwriting have evolved mainly around the same few datasets, i.e., the training and testing of these methods have been confined typically to a handful of datasets, such as the IAM on-line [IAMOnDB], QUWI [al2012quwi], KHATT [mahmoud2014khatt], and MSHD [djeddi2014lamis] datasets. The motivation in this paper is mainly twofold: (1) Propose an improved gender classification method, and (2) augment the current pool of handwriting datasets in a significant manner. Specifically, we propose a new convolutional neural network (CNN) variant for the gender classification task, which is relatively simple, efficient, and accurate. Also, we present a fairly large and diverse dataset, the Hebrew-English Bar-Ilan University (HEBIU) offline handwriting dataset, which consists of 810 Hebrew and English handwriting samples, collected from a group of 405 participants. The newly-generated dataset would allow for extended research and comparative studies, regarding the classification of various attributes of interest. Our results are comparable to those reported by previous methods, and they are substantially better than the accuracy rates obtained by human examiners on our HEBIU dataset.

2 Related Work

Several machine learning techniques have been applied during the past two decades to the handwriting gender classification task. These approaches are based typically on feature extraction and training classifier; see Table 

1 below (extended from Gattal et al. [gattal2018gender]), for an overview.

Research Features Classifier Dataset Accuracy
Cha et al. [cha2001priori] (2001) A set of macro and micro features ANN CEDAR [hull1994database] 70.20%

Liwicki et al. [liwicki2007automatic] (2011)
Combination of online & offline features GMM IAM-OnDB [IAMOnDB] 65.57%

Youssef et al. [youssef2013automated] (2013)
Gradient & WD-LBP SVM QUWI [al2012quwi] 74.30%

Al-Maadeed et al. [al2014automatic] (2014)
Geometric Random forests QUWI [al2012quwi] 73%

Bouadjenek et al. [bouadjenek2014local] (2014)

Siddiqi et al. [siddiqi2015automatic] (2015)
Orientation curvature & legibility SVM QUWI [al2012quwi] & MSHD [djeddi2014lamis] 68.75%/73.02%

Mirza et al. [mirza2016gender] (2016)

Gabor filters & Fourier transform

ANN QUWI [al2012quwi] 70%

Akbari et al. [akbari2017wavelet] (2017)
Wavelet sub-hands SVM/ANN QUWI [al2012quwi] & MSHD [djeddi2014lamis] 80%

Ahmed et al. [ahmed2017improving] (2017)
Textural Ensemble of classifiers QUWI [al2012quwi] 79%–85%

Gattal et al. [gattal2018gender] (2018)
Oriented basic image features SVM QUWI [al2012quwi] 68%–76%

Morera et al. [morera2018gender] (2018)
Word seperation CNN IAM [IAMOnDB] & KHATT [mahmoud2014khatt] 80.72%/68.9%
Table 1: Overview of handwriting gender classification techniques.

Cha et al. [cha2001priori] trained an artificial neural network (ANN) in order to classify demographic sub-categories (such as gender, handedness, and age group) by using their own uppercase letter dataset. Later, they extended their work [bandi2005writer]

to train a feed-forward neural network for feature extraction and classification, using enhancement techniques as bagging and boosting. Their improved gender classifier achieved an accuracy rate of 77.5% using 800 writing samples for training and 400 samples for testing.

Liwicki et al. [liwicki2007automatic] applied support vector machines (SVM) and Gaussian mixture models (GMM) to gender classification on the IAM-OnDB handwriting dataset. Their classifier achieved accuracy rates of 62% and 67%, respectively, using SVM and GMM.

Youssef et al. [youssef2013automated] proposed using wavelet domain local binary patterns (WD-LBP) to train several SVM classifiers on both English and Arabic handwritings. Their classifier achieved an accuracy rate of 74.3% on (a subset of) the QUWI dataset.

Al-Maadeed et al. [al2014automatic] proposed using geometric features to classify age, gender, and nationality. Their proposed method applies random forests and kernel discriminant analysis for both text-dependent and text-independent classifications (i.e., same/different texts, respectively, of different writers are used for training and testing). Their classifier achieved an overall accuracy of 73% on the QUWI dataset.

Bouadjenek et al. [bouadjenek2014local] proposed extracting local descriptors, such as histogram of oriented gradients (HoG), local binary patterns (LBP), and grid features for offline handwriting, and then classifying them by SVM. Their method achieved an accuracy rate of 74% on the IAM offline dataset. Likewise, Bouadjenek et al. [bouadjenek2015age] used local descriptors, such as gradient local binary patterns (GLBP) and HoG to train an SVM classifier to predict age, gender, and handedness. Their classifier achieved accuracy rates in the range of 69%–74% on the IAM-OnDB and KHATT datasets.

Similarly, Siddiqi et al. [siddiqi2015automatic] enhanced handwriting features by computing local and global features (e.g., inclination, texture, curvature, legibility, etc.), which are then used in ANN and SVM classifiers to distinguish between genders. Their classifier achieved accuracy rates of 68.75% and 73.02%, respectively, on the QUWI and MSHD datasets.

Mirza et al. [mirza2016gender] concentrated on the visual appearance of handwriting to investigate its effect on a writer’s gender. They extract textural information by applying a bank of Gabor filters

to handwriting images from the QUWI dataset. They then use the mean and standard deviation of each handwriting plus its Fourier transform as input features for a feed-forward neural network. Their classifier achieved an accuracy rate of 70% on the QUWI dataset.

Akbari et al. [akbari2017wavelet] extracted a feature vector based on a series of wavelet sub-bands quantized to produce a probabilistic finite state automaton. This feature vector is then used to train ANN and SVM classifiers on the QUWI and MSHD datasets, and perform text-dependent and text-independent, as well as script-dependent and script-independent classifications (i.e., same/different languages, respectively, used for training and testing). They also introduced cross-database evaluations.

To enhance accuracy rates on the gender task, Ahmed et al. [ahmed2017improving] used bagging, voting, and stacking of various classifiers based on some of the textural features mentioned earlier. They achieved accuracy rates in the range of 79%–85% on (a subset of) the QUWI dataset.

Gattal et al. [gattal2018gender]

proposed using textural information from handwriting as the discriminative attribute between genders. They used image binarization and

oriented basic image features. Their classifier achieved accuracy rates of 71%, 76%, and 68% on the QUWI dataset, according to the protocols of ICDAR 2013, ICDAR 2015, and ICFHR 2016, respectively.

Finally, Morera et al. [morera2018gender] were the first to apply a deep CNN for classifying a writer’s demographics. They proposed the same architecture for both gender and handedness, as well as an architecture for the combined 4-class problem. Their gender classifier achieved accuracy rates of 80.72% and 68.9%, respectively, on the IAM-OnDB and KHATT datasets.

To summarize, most of the surveyed methods exploit knowledge about the domain to extract certain features from the above datasets, and then train a machine learning module to classify these extracted features. In contrast, we present in this work a deep learning module, which performs essentially automated feature extraction and classification, in a rather simple and efficient manner (requires no tedious preprocessing, and is far less complex than the system reported, e.g., by Morera et al. [morera2018gender]).

3 Proposed Method

3.1 The HEBIU Offline Handwriting Dataset

Our newly generated dataset, the Hebrew-English Bar-Ilan University (HEBIU) offline handwriting dataset, contains 810 Hebrew and English handwriting samples of 405 participants from Israel. Each participant received a standard form, and was asked to write certain texts in Hebrew and English without any writing restrictions (e.g., pen type, pressure, etc.). In addition, each contributer was asked to provide personal data, such as gender, age, height, handedness, native language, country of birth, religion, education level, and profession.

Each such form was scanned by a 300dpi HP OfficeJet Pro 8710, in color mode and JPEG format, at a high resolution of .

The added value of our newly presented HEBIU dataset lies in the fact that it contains (also) hundreds of labeled writing samples in Hebrew, as well as diverse personal information per each participant. Thus, additional tasks, such as writer identification/verification and the classification of various demographic characteristics from handwriting samples, can be further pursued with such data.

3.2 Handwriting Preprocessing

As previously mentioned, our HEBIU dataset contains 810 Hebrew and English handwriting samples of 214 males and 191 females (i.e., of a total of 405 participants). Thus, to keep the data balanced, we excluded from the dataset, as part of preprocessing, 23 of the male forms.

In addition, the data should be normalized to be compatible with the network’s architecture. Therefore, the first step was to extract a portion of the page which contains handwritten text, and convert it to a grayscale image. Afterwards, in order to enhance our data, we generated random patches for each form, of size , with (possible) overlaps between patches. A patch can be either a square or a rectangle. A square patch is meant to extract a whole subsection of words, while a rectangular patch is used to extract a line of text (or part of it), a single word, a writing sequence, etc. Both cases are illustrated in Fig. 1.

Having experimented extensively with the number of patches, as well as patch types and patch sizes, we converged eventually on patches per handwritten sample and squared patches of size pixels (i.e., ). To keep the computational effort feasible, the patches were downscaled by 75% to pixels. (Similarly, the originally extracted rectangular patches of size were downscaled to .)




Figure 1: Examples of resized text patches: (a)+(b) English and Hebrew squared patches, and (c)+(d) English and Hebrew rectangular patches.

Naturally, some of the generated patches were blank or contained small amounts of data. To overcome the selection of sparse text patches, we conducted a series of experiments to determine a threshold, based on a minimum ratio between black pixels and the total amount of pixels in a given patch. This was then used to select patches which contained a sufficient amount of data. Note that eventually we extracted 200 valid patches per each form.

3.3 Network Architecture

Our proposed network architecture is a CNN variant which inputs a grayscale, patch and outputs the gender prediction. It is comprised of a total of four convolutional layers, followed by a single fully-connected layer and a softmax output layer, where all of the filters used are of size

. More precisely, the first two layers consist of 64 and 128 filters, respectively, followed by a max pooling layer of

with a dropout of 0.4. The next two layers have the same structure, followed by a

max pooling layer with a dropout of 0.6. Finally, a fully-connected layer with 128 neurons was added with a dropout of 0.5. The following network’s hyper-parameters were picked: 20 epochs, a

rectified linear unit

(ReLu) activation function 

[nair2010rectified], an Adadelta

optimizer, and a binary cross entropy loss function.

3.4 Accuracy Evaluation by Patch Aggregation

We considered the following two classification measures, for a given handwriting sample:

  1. Majority vote: The gender class is determined based on the majority of classified patches, where the classification of each patch depends on whether the corresponding softmax value exceeds 0.5.

  2. Average softmax: The form is classified according to the average softmax value over the form’s 200 patches.

4 Experimental Results

We divided the gender classification problem, in the context of this work, into three main types: (1) Intra-language classification, where training and testing are conducted on the same language, (2) inter-language classification, where training is conducted on one language and testing on the other, and (3) mixed language classification, where both training and testing are conducted on both languages. For each type, we ran a 10-fold cross validation as follows. A fixed 20% of the data (i.e., the same 76 forms) were set aside for testing, and 70% (i.e., 268 forms) and 10% (i.e., 38 forms) of the data, respectively, were allocated at random (from the remaining 80%) for training and validation.

4.1 Intra-Language Classification

Regarding intra-language classification, we obtained average accuracy rates of 73.02% and 75.26%, respectively, in the case of Hebrew-Hebrew (i.e., training and testing performed on Hebrew texts) and English-English (i.e., both training and testing done on English texts).

4.2 Inter-Language Classification

For inter-language classification, we achieved accuracy rates of 75.65% and 58.29%, respectively, in the case of Hebrew-English classification (i.e., training on a Hebrew handwriting and testing on an English one) and English-Hebrew classification (i.e., training on an English handwriting and testing on a Hebrew one).

One attempt to explain this anomaly might be that since English is a second natural language in Israel (after Hebrew), the discriminative features between gender handwritings are less prominent (than in Hebrew), so generalizing becomes more challenging.

4.3 Mixed Language Classification

Enhancing our data by combining the texts of both languages yields an overall test accuracy of 77% for both languages; in particular, 74.61% and 79.34% accuracy rates when tested on Hebrew and English texts, respectively.

4.4 Summary of Results

Table 2 summarizes the results, providing average accuracy rates and standard deviations for each method.

Experiment Train Test Accuracy Avg Std Min Max
Method Dev Accuracy Accuracy
Intra-Language HE HE Majority vote 73.02% 2.42 67.10% 75.00%
Avg. softmax 72.89% 2.34 67.10% 75.00%
EN EN Majority vote 74.47% 2.65 69.74% 77.63%
Avg. softmax 75.26% 2.47 71.05% 77.63%

HE EN Majority vote 75.52% 6.86 60.52% 82.89%
Avg. softmax 75.65% 7.40 57.89% 82.89%
EN HE Majority vote 58.29% 5.89 48.68% 65.79%
Avg. softmax 58.29% 6.20 48.68% 68.42%

HE+EN HE Majority vote 74.61% 2.06 72.37% 77.63%
Avg. softmax 73.82% 2.36 68.42% 76.32%
HE+EN EN Majority vote 79.34% 3.29 73.68% 82.89%
Avg. softmax 79.21% 3.15 73.68% 81.58%
HE+EN HE+EN Majority vote 75.13% 2.52 71.05% 78.95%
Avg. softmax 75.13% 2.10 71.05% 77.63%

Table 2: Accuracy for gender classification types with 10-fold cross-validation (“HE” stands for Hebrew, and “EN” stands for English).

4.5 Human Test Results

In order to compare our results with those of human examiners, we developed a mobile application that tests the accuracy of humans on the same task. The application was distributed among 153 females and 147 males; each of the 300 participants received 15 Hebrew handwritings and 15 English handwritings chosen at random (from our HEBIU dataset), and was asked to predict the writer’s gender of each examined text. The average classification accuracy for English and Hebrew handwritings were 63.6% and 66.2%, respectively (both with a standard deviation of 0.13). Females achieved slightly better results than males in both cases. Specifically, they obtained an accuracy of 64.8% (vs. 62.2%) for English, and an accuracy of 67.4% (vs. 65%) for Hebrew. No correlations between the accuracy and either age group or education level were observed.

5 Concluding Remarks

In this paper, we proposed an automatic deep learning scheme for binary gender classification from handwriting images. Specifically, we presented a CNN variant for this task without “manual” feature selection/extraction. Our module is relatively simple, yet efficient, in terms of training speed and running time. We considered seven cross-language cases, including training on a Semitic language (Hebrew) and validation on a non-Semitic one (English), and vice versa. Our classification results are comparable to those of previous methods, and are significantly better than those obtained by human examiners on the same dataset.

In addition, we presented a new offline handwriting dataset (the HEBIU dataset), which contains hundreds of labeled handwriting samples in both Hebrew and English, including diverse demographic information.

Our future work will focus on predicting additional attributes of a given writer, e.g., handedness, age group, whether the text is written in the subject’s mother tongue, etc. In addition, we plan to apply our approach to other existing handwriting datasets and aim to enlarge our dataset by collecting more handwriting samples, possibly in additional languages.