Gender Detection on Social Networks using Ensemble Deep Learning

04/13/2020 ∙ by Kamran Kowsari, et al. ∙ University of Virginia 7

Analyzing the ever-increasing volume of posts on social media sites such as Facebook and Twitter requires improved information processing methods for profiling authorship. Document classification is central to this task, but the performance of traditional supervised classifiers has degraded as the volume of social media has increased. This paper addresses this problem in the context of gender detection through ensemble classification that employs multi-model deep learning architectures to generate specialized understanding from different feature spaces.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

From 2012 to 2017 the average time spent on social networking has increased from 90 minutes to 135 minutes. Every second approximately 6,000 tweets appear on Twitter (visualize them here), which amounts to about 350,000 tweets per minute, 500 million tweets per day, and 200 billion tweets per year. This volume demands increasingly sophisticated approaches to author profiling and classification.

Much of the recent work on automatic author-profiling based on classification has involved supervised learning techniques such as classification trees, Naïve Bayes, support vector machines (SVM), neural nets, and ensemble methods. Classification trees and naïve Bayes approaches provide good interpretability but tend to be less accurate than other methods 


Researchers from a variety of disciplines have produced work relevant to the approach described in this paper. We have organized this work into four areas: social networks, feature extraction, classification methods and techniques, and deep learning for classification.

Social Networks: Social networks are structures with nodes that represent people or entities within a social context, and whose edges represent interactions, influence, and collaboration between the entities [liben2007link]. They are dynamic platforms that change quickly, acquiring new nodes and edges that signify new interactions between the entities [liben2007link]. Social networks provide a basis for maintaining social relationships, finding users with similar interests, outlooks, and causes [mislove2007measurement].

In recent years, researchers have attempted to perfect gender detection in these contexts of these networks. Social networks typically allow members to determine what name, race, age, they associate with their profile, but this self-reporting increases the likelihood of false identities within the network [peersman2011predicting]. These false identities can reduce the quality of analyses of content, intended audience, or structure.

Early work attempting to determining the gender of social network users relied primarily on text analysis and focused on psychological indicators of the gender of the author [heidarysafa2019women]. For example, Pennebaker and Graybeal 2001 [Pennebaker2015] tried to examine if text analysis could reveal the personality traits and gender of several subjects. More recently, Peersman et. al. 2011 [peersman2011predicting]

, in a bid to identify and curb pedophiles who falsify their identities on social networks, using natural language processing on data from the Dutch platform Netlog, to detect gender.

Figure 1: Pipeline of gender detection

Feature Extraction:

Feature extraction plays a prominent role in machine learning, especially for text, image, and video data. Text and many biomedical datasets are mostly unstructured data from which we need to generate meaningful structures for use by machine learning algorithms. To take one early example, L. Krueger 

et. al. in 1979 [krueger1979letter] introduced an effective method for feature extraction based on word counting to create a structure for statistical learning. Even earlier work by H. Luhn [luhn1957statistical] introduced weighted values for each word. In 1988 G. Salton et. al. [salton1988term] modified the weights of words by frequency counts called term frequency-inverse document frequency (TF-IDF). The TF-IDF vectors measure the number of times a word appears in the document, weighted by the inverse frequency of the commonality of the word across documents. Although the TF-IDF and word counting are simple and intuitive feature extraction methods, they do not capture relationships between words as sequences. More recently, T. Mikolov et. al. [mikolov2013efficient] introduced an improved technique for feature extraction from text using the concept of embedding or placing the word into a vector space based on context. This approach to word embedding, called Word2Vec, solves the problem of representing contextual word relationships in a computable feature space. Building on these ideas, J. Pennington et. al. in 2014 [pennington2014glove] developed a learning vector space representation of the words called Glove and deployed it in Stanford NLP lab. The RMDL approach described in [Kowsari2018RMDL, Heidarysafa2018RMDL] uses Glove and tf-idf for feature extraction from textual data.

Classification Methods and Techniques: Over the last 50 years, many supervised learning classification techniques have been developed and implemented in software to accurately label data. For example, the researchers, K. Murphy in 2006 [murphy2006naive] and I. Rish in 2001 [rish2001empirical]

introduced the Naïve Bayes Classifier (NBC) as a simple approach to the more general representation of the supervised learning classification problem. This approach has provided a useful technique for text classification and information retrieval applications. As with most supervised learning classification techniques, NBC takes an input vector of numeric or categorical data values and produces the probability for each possible output labels. This approach is fast and efficient for text classification, but NBC has important limitations. Namely, the order of the sequences in the text is not reflected on the output probability because for text analysis, naïve bayes uses a bag of words approaches for feature extraction. Another popular classification technique is Support Vector Machines (SVM), which has proven quite accurate over a wide variety of data. This technique constructs a set of hyper-planes in a transformed feature space. This transformation is not performed explicitly but rather through the kernel trick which allows the SVM classifier to perform well with highly nonlinear relationships between the predictor and response variables in the data. A variety of approaches have been developed to further extend the basic methodology and obtain greater accuracy. C. Yu 

et. al. in 2009 [yu2009learning] introduced latent variables into the discriminative model as a new structure for SVM, and S. Tong et. al. in 2001 [tong2001support]

added active learning using SVM for text classification. For a large volume of data and datasets with a huge number of features (such as text), SVM implementations are computationally complex. Another technique that helps mediate the computational complexity of the SVM for classification tasks is stochastic gradient descent classifier (SGDClassifier) 

[kabir2015bangla] which has been widely used in both text and image classification. SGDClassifier is an iterative model for large datasets. The model is trained based on the SGD optimizer iteratively.

Deep Learning:

Neural networks derive their architecture as a relatively simple representation of the neurons in the human’s brain. They are essentially weight combinations of inputs the pass through multiple non-linear functions. Neural networks use an iterative learning method known as back-propagation and an optimizer (such as stochastic gradient descent (SGD)). In recent years many researchers have achieved state-of-art results using Deep Learning in the domain of Social Media, Psychology 

[nobles2018identification], transportation [heidarysafa2018analysis], health [zhang2018patient2vec], medical data processing [kowsari2019diagnosis], etc.

Deep Neural Networks (DNN) is based on simple neural network architectures but they contain multiple hidden layers. These networks have been widely used for classification. For example, D. CireşAn et. al. in 2012 [ci2012multitraffic]

used multi-column deep neural networks for classification tasks, where multi-column deep neural networks use DNN architectures. Convolutional Neural Networks (CNN) provide a different architectural approach to learning with neural networks. The main idea of CNN is to use feed-forward networks with convolutional layers that include local and global pooling layers. A. Krizhevsky in 2012 

[krizhevsky2012imagenet] used CNN, but they have used  convolutional layers combined with the   feature space of the image. Another example of CNN in [lecun2015deep] showed excellent accuracy for image classification. This architecture can also be used for text classification as shown in the work of [kim2014convolutional]. For text and sequences, 

convolutional layers are used with word embeddings as the input feature space. The final type of deep learning architecture is Recurrent Neural Networks (RNN) where outputs from the neurons are fed back into the network as inputs for the next step. Some recent extensions to this architecture uses Gated Recurrent Units (GRUs) 


or Long Short-Term Memory (LSTM) units 

[hochreiter1997long]. These new units help control for instability problems in the original network architecture. RNN has been successfully used for natural language processing [mikolov2010recurrent]. Recently, Z. Yang et. al. in 2016 [yang2016hierarchical] developed hierarchical attention networks for document classification. These networks have two important characteristics: hierarchical structure and an attention mechanism at word and sentence level.

New work has combined these three basic models of deep learning and developed a novel technique for enhancing accuracy and robustness. The work of M. Turan et. al. in 2017 [turan2017deep] and M. Liang et. 2015 [liang2015recurrent] implemented innovative combinations of CNN and RNN called A Recurrent Convolutional Neural Network (RCNN). K. Kowsari et. al. in 2017   [kowsari2017HDLTex] introduced hierarchical deep learning for text classification (HDLTex) which is a combination of all deep learning techniques in a hierarchical structure for document classification has improved accuracy over traditional methods. The work in this paper builds on these ideas, specifically the work of [kowsari2017HDLTex] to provide a more general approach to supervised learning for classification.

2 Preprocessing

2.1 Text cleaning

2.1.1 Tokenization

Tokenization is a pre-processing method which breaks a stream of text into words, phrases, symbols, or other meaningful elements called tokens [gupta2015text]. The main goal of this step is the investigation of the words in a sentence [verma2014tokenization]. Both text classification and text mining require a parser which processes the tokenization of the documents; for example:
sentence [aggarwal2018machine] :

After sleeping for four hours, he decided to sleep for another four.
In this case, the tokens are as follows [kowsari2019text]:

{ “After” “sleeping” “for” “four” “hours” “he” “decided” “to” “sleep” “for” “another” “four” }.

2.1.2 Stop words

Text and document classification includes many words insignificant for classification algorithms such as {“a”, “about”, “above”, “across”, “after”, “afterwards”, “again”,}. The most common technique to deal with these words is to remove them from the texts and documents [saif2014stopwords].

2.1.3 Capitalization

Text and document data have a diversity of capitalization to indicate the start of sentences. Since documents consist of many sentences, diverse capitalization can be hugely problematic when classifying large documents. The most common approach for dealing with inconsistent capitalization is to reduce every letter to lower case. This technique projects all words in text and document into the same feature space, but it causes problems for interpretation of some words  (i.e. "US" (United States of America) to "us" (pronoun))[gupta2009survey]. Slang and abbreviation converters can help account for these exceptions[dalal2011automatic].

2.1.4 Stemming and Lemmatization

Words can come in different forms (i.e. the singular and plural noun form) but the semantic meaning of each form might remain the same [spirovski2018comparison]. Stemming is one method for consolidating different forms of a word into the same feature space. Text stemming modifies words to obtain variant word forms using different linguistic processes such as affixati (e.g. the stem of the word "studying" is "study"). Lemmatization is a NLP process that replaces the suffix of a word with a different one or removes the suffix of a word completely to get the basic word form (lemma)

2.2 Feature Extraction

2.2.1 Term Frequency-Inverse Document Frequency

proposed inverse document frequency (IDF) as a method to be used in conjunction with term frequency in order to lessen the effect of implicitly common words in the corpus. IDF assigns a higher weight to words with either high frequencies low frequencies term in the document. This combination of TF and IDF is well known as the term frequency-inverse document frequency (tf-idf). The mathematical representation of the weight of a term in a document by tf-idf is given in Equation 1.


Here  is the number of documents and  is the number of documents containing the term t in the corpus. The first term in Equation 1 improves the recall while the second term improves the precision of the word embedding [tokunaga1994text]. Although tf-idf tries to overcome the problem of common terms in the document, it still suffers from some other descriptive limitations. Namely, tf-idf cannot account for the similarity between the words in the document since each word is independently presented as an index. However, with the development of more complex models in recent years, new methods, such as word embedding, can incorporate concepts such as similarity of words and part of speech tagging.

2.2.2 Global Vectors for Word Representation (GloVe)

Another powerful word embedding technique that has been used for text classification is Global Vectors (Glove) [pennington2014glove]. The approach is very similar to the Word2Vec method where each word is presented by a high dimension vector and trained based on the surrounding words over a huge corpus. The pre-trained word embedding used in many works is based on  vocabularies trained over -Billion tweets which 27 billion tokens, uncased, 25d, 50d, 100d, and 200d vectors. This word embedding is trained over even bigger corpora, including Wikipedia and Common Crawl content. The objective function is as follows:


where  is refer to word vector of word , and  denoted to probability of word  to occurs in context of word .

Figure 2: Overview of RDML: Random Multimodel Deep Learning for classification that includes Random models which are random model of DNN classifiers, models of CNN classifiers where .

3 Methods

The method is based on Random Multimodel Deep Learning (RMDL) for text classification [Kowsari2018RMDL, Heidarysafa2018RMDL] which we used two different feature extraction and ensemble deep learning to train this model. Random Multimodel Deep Learning is a novel technique that we can use in any kind of dataset for classification. An overview of this technique is shown in Figure 2 which contains multi Deep Neural Networks (DNN), Deep Convolutional Neural Networks (CNN), and Deep Recurrent Neural Networks (RNN). The number of layers and nodes for all of these Deep learning multi models are generated randomly (e.g. 9 Random Models in RMDL constructed of  CNNs and  DNNs, all of them are unique due to randomly creation).


Where is the number of random models, and is the output prediction of model for data point in model  (Equation 3 is used for binary classification, ). Output space uses majority vote for final . Therefore,  is given as follows:


Where is number of random model, and shows the prediction of label of document or data point of for model and is defined as follows:


After all RDL models (RMDL) are trained, the final prediction is calculated using majority vote of these models.

Figure 3: Random Multimodel Deep Learning (RMDL)

3.1 Deep Neural Networks

Deep neural networks (DNN) are designed to learn by multi-connection of layers such that every single layer only receives the connection from the previous one and provides connections only to the next layer in hidden part [kowsari2017HDLTex]. The input consists of the connection of the input feature space (as discussed in Section 2) with the first hidden layer of the DNN. The input layer may be constructed via tf-idf, word embedding, or some other feature extraction method. The output layer is equal to the number of classes for multi-class classification or only one for binary classification. In this technique, multi-class DNNs, each learning model is generated (number of nodes in each layer and also the number of layers are completely random assigned). The implementation of DNN is a discriminative trained model that uses standard back-propagation algorithm using sigmoid (equation 6

), ReLU 

[nair2010rectified] (equation 7

) as activation function. The output layer for multi-class classification should be a

function (as shown in Equation 8).


Given a set of example pairs , the goal is to learn the relationship between these input and target spaces using hidden layers. In text classification applications, the input is a string which is generated via vectorization of the raw text data.

3.2 Convolutional Neural Networks (CNN)

A convolutional neural network (CNN) is a deep learning architecture that is commonly used for hierarchical document classification [jaderberg2016reading]. Although originally built for image processing, CNNs have also been effectively used for text classification [lecun2015deep]

. In a basic CNN for image processing, an image tensor is convolved with a set of kernels of size

. These convolution layers are called feature maps and can be stacked to provide multiple filters on the input. To reduce the computational complexity, CNNs use pooling to reduce the size of the output from one layer to the next in the network. Different pooling techniques are used to reduce outputs while preserving important features [scherer2010evaluation].

The most common pooling method is max pooling where the maximum element in the pooling window is selected. In order to feed the pooled output from stacked featured maps to the next layer, the maps are flattened into one column. The final layers in a CNN are typically fully connected. In general, during the back-propagation step of a convolutional neural network, both the weights and the feature detector filters are adjusted. A potential problem that arises when using CNN for text classification is the number of ’channels’,

 (size of the feature space). While image classification application generally have few channels (e.g., only 3 channels of RGB), may be very large (e.g., 50 K) for text classification applications [johnson2014effective], thus resulting in very high dimensionality. Figure 3, green box on left side, illustrate the CNN architecture for text classification which contains word embedding as input layer 1D convolutional layers, 1D pooling layer, fully connected layers, and finally output layer.

Adam Optimizer

Adam is used as stochastic gradient optimizer which uses only the first two moments of gradient (

and , shown in Equations (9)–(12

)) and calculate the average over them. It can handle non-stationary of the objective function as in RMSProp while overcoming the sparse gradient issue limitation of RMSProp 



where is the first moment and

indicates second moment that both are estimated.

and .

Measure Value Derivations
Sensitivity 0.8914
Specificity 0.8390
Precision 0.8272
Negative Predictive Value 0.8992
False Positive Rate 0.1610
False Discovery Rate 0.1725
False Negative Rate 0.1086
Accuracy 0.8633
F1 Score 0.8583
Matthews Correlation Coefficient 0.7285
Table 1: Method results

4 Results

4.1 Data

Gender detection in Twitter. Demographics traits such as gender and language variety have so far been investigated separately. PAN provides participants with a Twitter corpus annotated with authors’ gender and their specific variation of their native language: English (Australia, Canada, Great Britain, Ireland, New Zealand, United States). In this research, we use combination of gender detection of Twitter posts [rangel2013overview, rangel2015overview] which contains large number of anonymous authors labeled with gender. This dataset is balanced by gender with 3600 people in the training set and 2400 for the test.

4.2 Evaluation and Experimental Results

Since the underlying mechanics of different evaluation metrics may vary, understanding what exactly each of these metrics represents and what kind of information they are trying to convey is crucial for comparability. Some examples of these metrics include recall, precision, accuracy, F-measure, micro-average, and macro average. These metrics are based on a “confusion matrix” that comprises true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN) 

[lever2016points]. The significance of these four elements may vary based on the classification application. The fraction of correct predictions over all predictions is called accuracy (Eq. 13). The fraction of known positives that are correctly predicted is called sensitivity i.e. true positive rate or recall (Eq. 14). The ratio of correctly predicted negatives is called specificity (Eq. 15). The proportion of correctly predicted positives to all positives is called precision, i.e. positive predictive value (Eq. 16).


is one of the most popular aggregated evaluation metrics for classifier evaluation [lever2016points]. The parameter is used to balance recall and precision and defined as follows:


For commonly used  i.e. , recall and precision are given equal weights and Eq. 17 can be simplified to:


Since is based on recall and precision, it does not represent the confusion matrix fully. The Matthews correlation coefficient (MCC) [matthews1975comparison] captures all the data in a confusion matrix and measures the quality of binary classification methods. MCC can be used for problems with uneven class sizes and is still considered a balanced measure. MCC ranges from to (i.e., the classification is always wrong and always true, respectively). MCC can be calculated as follows:


While comparing two classifiers, one may have a higher score using MCC and the other one has a higher score using and as a result one specific metric cannot captures all the strengths and weaknesses of a classifier [lever2016points].

Figure 1 shows our results by different measure as follows. Sensitivity of our model is , and Specificity is equal to . The table shows Negative Predictive Value (NPV) is , False Positive Rate is . As we discussed the evaluation measures in this Section, Matthews Correlation Coefficient (MCC) is ; and finally, Accuracy and F1-Score is equal to  and  respectively.

4.3 Hardware and Framework

All of the results shown in this paper are performed on Central Process Units (CPU) and Graphical Process Units (GPU). This model can be performed on only GPU, CPU, or both. The processing units that have been used through this experiment was intel on Xeon E5-2640  (2.6 GHz) with 12 cores and 64 GB memory (DDR3). The graphical card on our machine is Nvidia Quadro K620 and Nvidia Tesla K20c.

This work is implemented in Python using Compute Unified Device Architecture (CUDA) which is a parallel computing platform and Application Programming Interface (API) model created by . We used and library for creating the neural networks [chollet2015keras].

5 Conclusion

Developing methods for reliable gender detection using text classification is increasingly important the growing size of social text and other document sets. The techniques presented here demonstrate that that semantic, syntactic, and word frequency can facilitate gender detection in social messages. This approach improves on existing practice with different feature extraction techniques and deep learning architectures to train from a dataset as a ensemble learning technique. Additional training and testing with other structured document data sets will continue to identify architectures that work best for these problems. It is possible to extend this model by using additional deep learning architecture as ensemble learning instead of two models to capture more of the complexity in the text classification.