Web users actively generate content in various social media platforms. Modeling users by inferring their age and gender plays an important role in providing personalized services, viral marketing, recommender systems and tailored advertisements Nowson and Oberlander (2006). In addition to age and gender, previous work in the field of psychology has highlighted the value of identifying the personality traits of users as an aid in building adaptive and personalized systems to provide rich and improved user experiences Rodrigo de et al. (2011); Tkalcic and Chen (2015).
Various computational approaches for user profiling and inferring age, gender and personality traits based on user-generated content (UGC) have been proposed in recent years Rangel et al. (2015); Rothe et al. (2015); Farnadi et al. (2016); more details on related works are presented in Sections 4.2, 4.3, and 4.4. Much of these efforts are aimed at finding novel techniques to infer user profiles using only one type of information, such as the user’s textual posts. However, in many social media platforms, users can generate content in different modalities, such as textual content (e.g., status updates, blog posts, tweets, comments, etc.) and visual content (e.g., photo and video), while connecting with each other, i.e., creating relational content. A framework that leverages all available information about users can learn more accurate user profiles. This is especially useful for platforms where not every user generates the same type of information, and models trained based on one source of information fail to produce accurate user profiles. Examples include users who write status updates but never upload pictures, or users who join social media platforms only to consume knowledge and to relate with each other, rather than producing any textual or visual content themselves.
To address this, we propose a flexible user profiling framework that infers age, gender and Big Five personality traits using both UGC and social relational content. Our approach is based on a statistical relational framework with Hinge-loss Markov Random Fields (HL-MRFs) Bach et al. (2015)
. In particular, we use Probabilistic Soft Logic (PSL), a probabilistic programming language for defining HL-MRFs using weighted first-order logical rules, making them very expressive and suitable for modeling relational data like social network graphs. Recently, PSL has been successfully used for social network applications, with state-of-the-art results, such as sentiment analysis in social networksWest et al. (2014), social trust propagation Huang et al. (2013) and spam detection in social networks Fakhraei et al. (2015).
Related work on combining multiple sources of user data have focused on fusion of features Cui et al. (2010); You et al. (2016); Sakaki et al. (2014) known as early fusion techniques, or use ensemble techniques to combine the results predicted from each source (i.e., majority voting or weighted majority voting) Sakaki et al. (2014), known as late fusion techniques. Much of these efforts focus on combining various UGC sources and ignore social relational content to infer user profiles. Existing approaches which incorporate social relational content with UGC are mostly focused on finding informative features to combine them with the relational content features such as communities Huang et al. (2015); Zhou et al. (2015). The graphical structure of social media platforms is a rich source of information on user behaviour which, when leveraged properly, is very valuable for user profiling. There are a few related works in which social graphs have been used to combine various sources of data for collective classification purposes. The focus of these existing approaches is on tasks other than user profiling, such as the hybrid recommender system by Kouki et al. Kouki et al. (2015) that combines multiple sources of user rating activities. Our proposed user profiling framework not only incorporates social relational content to collectively predict user characteristics but also provides a mechanism to combine various other sources of user data for more accurate modeling of users.
In addition to incorporating multiple sources of information, we infer multiple characteristics of the users at the same time. Using PSL makes our framework interpretable, and it is easy to add rules to the PSL model to infer a new characteristic or, similarly, remove one from the model. Furthermore, our approach is flexible and lends itself easily to incorporating other sources of information beyond the ones we consider in this paper. Finally, our technique works well with missing data. In particular, it does not require the availability of user data for all the information sources that are considered in the model.
We evaluate our model on data from Facebook with more than 5,000 users. We use the users’ textual posts, profile pictures and pages that they like to extract their profile by inferring their age, gender and personality traits. For personality traits, we use the traits of a widely accepted model, the Big Five personality model, consisting of the following five traits: Openness to experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism Costa and McCrae (2008)
. Our experimental results show that our proposed HL-MRFs model efficiently combines various sources of UGC to learn more accurate user profiles. To investigate whether the accuracy gain is due to the use of HL-MRFs or due to leveraging information from multiple sources simultaneously, we have trained a series of alternative models, including single-source logistic regression (LR) models, single-source HL-MRFs models, majority voting ensembles of the single-source models, and LR multi-source models. Our proposed HL-MRFs model outperforms them all. Our contributions include (1) a general and flexible framework to infer user characteristics from textual, visual, and relational social media content; (2) two interchangeable probabilistic graphical sub-models for user profiling based on user-item relations; (3) extensive experimental validation of the proposed models for predicting age, gender and personality traits of Facebook users based on their status updates, profile pictures and page likes.
The remainder of this paper is structured as follows: after reviewing the preliminaries of HL-MRFs and PSL in Section 2, in Section 3 we present our proposed model, including probabilistic graphical sub-models for inferring user characteristics from textual and visual content (Section 3.1) and sub-models for user profiling based on user-item relations (Section 3.2). An extensive experimental validation and empirical comparison of our proposed model with alternative models using Facebook data is presented in Section 4. Finally, we provide promising future directions of this work and conclude in Section 6.
2 HL-MRFs and PSL
Hinge-loss Markov Random Fields (HL-MRFs) are general classes of conditional probabilistic models with continuous values. HL-MRFs models are log-linear probabilistic models which enable efficient tractable inference. The distinguishing key of these models that make them tractable is the use of hinge-loss potentials of the form:
is a linear function of sets of random variablesand , and allows to define linear or squared potentials. The variables in and take on continuous values in the unit interval
. A HL-MRFs model defines a conditional probability density function over random variablesand conditioned on random variables using a set of hinge-loss potentials as follows:
where and the weights capture the importance of each potential in the model.
Throughout this paper we use Probabilistic Soft Logic (PSL), a weighted first-order logical language to specify HL-MRFs models, that makes these powerful models interpretable, flexible and expressive. A PSL model consists of a set of PSL rules of the form:
where and are predicates or negated predicates. Each predicate is of the form where is a predicate symbol, and each argument is either a constant or a variable. By instantiating all the variables with constants from their domains in rule , we ground the rule. is the weight of the rule .
An interpretation is a mapping that couples a continuous value to each ground predicate which specifies its value. The value of a ground rule in PSL is calculated based on Łukasiewicz logic Klir and Yuan (1995). Conjunction is interpreted by the Łukasiewicz t-norm (), disjunction by the Łukasiewicz t-conorm (), and negation by the Łukasiewicz negator (), which are defined as follows: for we have and . The indicates the relaxation over Boolean values. The distance to satisfaction of a ground PSL rule is defined as:
Consider the PSL rule as . Given interpretation , we instantiate , and . This instantiation results in the grounded PSL rule as follows:
Let , i.e. Alice is young to degree 1, and , i.e. Alice and Bob are friends to degree 0.7, then to fully satisfy the ground PSL rule , should be at least . If , then .
The goal of Maximum a Posteriori (MAP) inference in a HL-MRFs model (PSL model) is to find the most likely values for the variables in , given values for :
The grounded PSL rule in Example 1, results a hinge-loss potential function in the HL-MRF as:
Each of the rules in a PSL model induces a hinge-loss potential of the form 2
, in which the loss functionis defined through the distance to satisfaction of the rule as in 4. By Equation 2 it follows that the goal of optimization is to minimize the weighted sum of the distances to satisfaction of all rules.
Since Equation (2) is log-concave in , MAP inference in HL-MRFs models is a convex optimization problem and can be solved exactly via convex optimization. In this paper, we follow Bach et al.’s proposal of using an alternating direction method of multipliers (ADMM) based method for MAP inference in HL-MRFs Bach et al. (2015). Using ADMM allow to perform this optimization efficiently and in parallel which makes the inference scalable, fast and efficient.
3 User Profiling Model
In this section, we present our HL-MRFs model for user profiling in social media. In Section 3.1 we first present a generic model, as well as two instantiations of it, namely PSL-TXT for inferences based on text, and PSL-IMG for inferences based on images. As PSL is ideally suited for modeling relational data, we dedicate a separate section, Section 3.2, to two models for inferences based on user-item relations, namely PSL-DIRECT and PSL-LATENT. All these individual models are seamlessly combined together into a user profiling model, called PSL-PROFILE, in Section 3.4.
3.1 Generic model
Our PSL models for inferring the value of characteristic of user using source rely on the following two rules:
Rules (8)-(9) are ground versions of these rules, in which denotes that is an extrovert, and means that it is predicted from ’s text that she is an extrovert. In general, rule (6) expresses that the predicted value of a user’s characteristic based on content from a source is indicative of the user’s true value for this characteristic. Vice versa, rule (7) says that the characteristics of a user should show through in the content they create.
For some users, the value of characteristic might be known, while for others it has to be inferred. Similarly, for some users, some content of source might be available, allowing the use of any of the known single-source techniques from Section 4.2 and Section 4.3 to predict a value of characteristic for the user. A PSL model consisting of rules (6)-(7) can consume all this existing evidence and infer values for the missing user characteristics. More formally, as explained in Section 2, let be all the evidence, i.e., is the set of ground predicates such that , is known, and let be the set of ground predicates such that , is unknown. Let be the set of all users, consisting of the set of users with known characteristic , and the set of all users with unknown characteristic (i.e., ). Therefore , and , . In addition, if there is content of source available for user , then we can use existing single-source techniques to infer a value for , i.e., . A PSL model consisting only of the rules (6)-(7) for a single source produces results that are close to the external single-source approach for inferring user characteristics. The true potential is reached when combining ground versions of rules (6)-(7) for multiple sources.
The domain of variable , consists of fem, yng, opn, con, ext, agr, and neu. We consider text and images as sources, i.e., . These domains can be straightforwardly extended to include other information sources and user characteristics as well.
Our PSL model PSL-TXT consists of the 14 rules that are obtained when grounding the variable in (6)-(7) with , and grounding the variable with any of its values from it’s domain . Grounded versions of the predicate
are to assign values to users whose textual content is available. To compute these grounded predicates, we first extract textual features from users textual content, such as unigram features, dictionary based features, topics, writing style, etc. Next we classify them with a trained model (e.g., logistic regression). The model is trained over users in the dataset for which both textual content and labels for the user characteristic at hand are available. More details, including a motivation for the choice of logistic regression for the external single-source models, are given in Section4. Examples of ground rules of the model PSL-TXT are:
Similarly, our PSL model PSL-IMG consists of the 14 rules that are obtained when replacing the variable in (6)-(7) by , and replacing the variable by any of its values from it’s domain . To assign values to grounded predicates for users whose profile picture is available, we first extract visual features, e.g., pixels, from the images and score them with an external single-source model trained on instances where both the profile picture and the value of the user characteristic at hand are available. We provide more details about the external single-source approach in Section 4. Examples of ground rules from PSL-IMG are:
3.2 Relational model
Next we introduce two PSL models that make use of user-item relational information. The first, which we refer to as the direct collaborative model (PSL-DIRECT) is based on the idea that if two users like the same item, then if one of them has a specific characteristic, the other has the same characteristic. Let model the user-item relation. Using this relation we define the PSL model consisting of the following rules:
The second, which we refer to as the latent model (PSL-LATENT), is based on the characteristics of the items. To infer the characteristics of users, first we infer the hidden (latent) characteristics of the items. In this model we use latent variables to define the item characteristics, i.e., indicates that item represents characteristic . The value of is unknown for all items and inferred with MAP inference. The usefulness of using hidden factors to define items has been studied in McAuley and Leskovec (2013). The corresponding PSL model consists of the rules (14), (15), (16), and (17):
We may have an item that has no page like relation from . We therefore add to our PSL-LATENT model the rules (18) and (19) that include prior knowledge from the evidence. To initiate values for the characteristics of all items, we assign the average score () of each characteristic using the characteristics of users in :
3.3 Prior model
To include prior knowledge from the evidence in our model for all users, we define a set of preliminary rules to assign the average score () of each characteristic using the characteristics of users in . This prior knowledge is modeled with the PSL rules (20)-(21). The output of using a PSL model with these rules is equivalent to an average baseline approach. We call this model PSL-PRIOR.
3.4 Combined model
All the individual models that we have introduced can be combined together to build our PSL-PROFILE model to infer the unknown characteristics of users based on the known characteristics of other users using various sources of user data. Figure 1 presents the architecture of our PSL-PROFILE model, in which we combine PSL-PRIOR, PSL-TXT, PSL-IMG and PSL-LATENT models to infer the unknown characteristics of Carol based on the known characteristics of Alice. We extract textual features from the status updates of both Alice and Carol and train seven logistic regression models to predict the characteristics of Carol based on text, namely one model for each of the seven characteristics. We import these predicted values for each characteristic into our PSL-PROFILE model with the rules of the PSL-TXT model. Similarly, we extract visual features from the profile pictures of both Alice and Carol and train seven logistic regression models to predict the unknown characteristics of Carol using the known characteristics of Alice. We import the predicted results for each with the rules of PSL-IMG model into our PSL-PROFILE model. The pages that Alice and Carol like are modeled with the PSL-LATENT model to infer the known characteristics of Carol by using the hidden characteristics of pages that Carol likes. More details, including a motivation for the choice of the PSL-LATENT model for modeling user-item relations are given in Section 4.
4 Empirical Evaluation
In this section, we present an experimental evaluation of our HL-MRFs models for user profiling on Facebook data. In Section 4.1, we give details about the dataset from Facebook that we use in this study. To build a user profile, we incorporate three sources of user activities in Facebook, namely textual, visual and relational content. In Section 4.2 the details on inferring age, gender and Big Five personality traits from status updates (i.e., textual information) are presented. Similarly, Section 4.3 is dedicated to the details of predicting user characteristics from profile pictures (i.e., visual information). Next, in Section 4.4, we present the evaluation of our two user-item models from Section 3.2 by comparing them with competing methods that leverage user page like information (i.e., relational information). All these sections start with an overview of related works in user profiling using each of the respective sources, i.e. textual, visual, and relational. Finally, we present the complete evaluation results of applying our proposed HL-MRFs models on combined sources of information in Section 4.5.
4.1 Dataset and Evaluation Measures
To validate all methods we use a subset of the MyPersonality project dataset111http://mypersonality.org/. MyPersonality was a popular Facebook application introduced in 2007 in which users took a standard Big Five Factor Model psychometric questionnaire Goldberg et al. (2006) and gave consent to record their responses and Facebook profile. The dataset contains information about each user’s demographics, friendship links, Facebook activities (e.g., number of group affiliations, page likes, education and work history), status updates, profile picture and Big Five Personality scores. However, not all of this information is available for all users. We selected users who mention English as their language, and who provide age, gender, personality, status updates, page likes and a profile picture. To ensure that our prediction from the image belongs to the profile owner, we first selected profile pictures with only one face using OpenCV222http://opencv.org/ and a Haarcascade classifier Lienhart and Maydt (2002) and then re-selected the pictures with one face using the Project Oxford Face detector API333https://www.microsoft.com/cognitive-services/en-us/face-api.
By removing the Facebook pages with less than 3 likes by users in our dataset, our final dataset includes 49,372 pages, and 724,948 page like relations for 5,670 users. Personality traits are commonly described using five dimensions (known as the Big Five), i.e., Extraversion (ext), Agreeableness (agr), Conscientiousness (con), Neuroticism (neu), and Openness (opn). The range of the personality scores in our dataset is between . We use the median value to create binary classes for each characteristic, where median value for age = 23, opn = 4, con = 3.5, ext = 3.5, agr = 3.65, and neu = 2.75. We evaluate our user profiling model for the tasks of predicting age, gender and personality traits of Facebook users using their textual (status updates), visual (profile picture) and relational data (page likes).
We make two sub-samples from our dataset, the first one is used for model tuning and the second one for testing our HL-MRF model. Our model tuning sub-sample is used in Section 4.2, Section 4.3 and Section 4.4 to select the single source textual, visual and relational models respectively. To measure the performance of the relational model in Section 4.4, we extract the first sub-sample dataset for model tuning with those users from our dataset who liked more than 100 pages. Thus, the first sub-sample that we use as our model tuning dataset consists of 1,725 users, 48,641 Facebook pages and 592,907 page likes.
To build the second sub-sample, we use the remaining users, e.g., 3945 users, in our dataset as our test set. We systematically perform 10-fold cross-validation to collect all the results. The results of using the second sub-sample are presented in Section 4.5 and Section 5. To gather the results, we use our second sub-sample as test set where 1,725 users of the first sub-sample are never used for testing the HL-MRF models. We perform 10-fold cross validation where the users of our first sub-sample are added as the training samples to the training set at each fold.
Since all the characteristics that we aim to predict are binary (i.e., positive class vs. negative class), to evaluate the results, we use the following metrics: Accuracy: the portion of correct results among all the test instances, AUC
: the area under the receiver operating characteristic curve,PR+: the area under the precision-recall curve for the positive class, and PR-: the area under the precision-recall curve for the negative class. The inferred results are turned into binary labels by mapping scores to and scores to
. The natural language processing and machine learning approaches in the following sections are implemented using the scikit-learn library in Python. We implemented our HL-MRFs models using the publicly available PSL Java library with Groovy interface. Source code is available444 http://psl.umiacs.umd.edu.
4.2 Textual Model Selection
There is a substantial body of existing work on automatically inferring a user’s characteristics from the user’s digital footprint in social media platforms. Existing single-source models usually leverage either only text, image, or relational information. Machine learning models have been trained to infer the age, gender, and personality traits of users based on the textual context they produce, including blog posts and status updates Farnadi et al. (2016); Rangel et al. (2015); Schwartz et al. (2013). Author profiling has gained a lot of attention in the past few years. Workshops and competitions such as PAN555http://pan.webis.de/ which focus on various features and techniques to predict age and gender of authors in various languages, or shared tasks such as WCPR666https://sites.google.com/site/wcprst/home/wcpr14 for personality prediction are a few examples.
During pre-processing, we combine the status updates of each user in the dataset into one document per user. From these, we extract two sets of textual features: (1) Linguistic Inquiry and Word Count (LIWC) features Pennebaker and King (1999)
and (2) n-gram features. LIWC features are known to perform well in personality predictionFarnadi et al. (2016), and n-gram features are very popular and well-known in author profiling Nowson and Oberlander (2006).
For each user, we extract 88 features using the LIWC tool, consisting of features related to (a) standard counts (e.g., word count), (b) psychological processes (e.g., the number of anger words such as hate, annoyed, … in the text), (c) relativity (e.g., the number of verbs in the future tense), (d) personal concerns (e.g., the number of words that refer to occupation such as job, majors, …), and (e) linguistic dimensions (e.g., the number of swear words). For a complete overview, we refer to Tausczik and Pennebaker (2010).
(2) n-grams: for each user, we extract n-gram features where from their status updates. As a weighting mechanism we use TF-IDF, and to select the top features we use Chi-square hypothesis testing where .
For each user characteristic, we train four models using the extracted LIWC features, namely a support vector machine with linear kernel classifier, a decision tree classifier, a Naive Bayes classifier, and a logistic regression classifier. And the logistic regression models outperform the other models for all characteristics in predicting the correct label.
We train similar models over the n-gram features, where logistic regression again outperforms the other models. We omit the results because of space constraints. We then compare the performance of the LIWC-based models and n-gram-based models. The models based on the extracted LIWC features outperform the n-gram models in general. The n-gram-based model works slightly better than the LIWC based model to predict age and Neuroticism. As the textual predictor in our PSL models, e.g., PSL-TXT, we use the LIWC model trained with logistic regression. Detailed results are presented in Table 2, Table 3 and Table 4 and discussed next. These include results about the performance of the LIWC based models with logistic regression as stand-alone single-source predictors in PSL-TXT, as well as when integrated into our PSL models including PSL-PROFILE.
4.3 Visual Model Selection
Recently important progress has been made on age and gender identification from visual contentet al. (2015)
. There are competitions concentrating on this task as well, such as the LAP Challenge 2016 on predicting apparent age estimation and gender classification of images777http://gesture.chalearn.org/. Although much progress has been made in predicting age and gender from visual content, there is a limited focus on inferring personality traits. In Biel and Gatica-Perez (2013), Biel and Gatica-Perez focus on predicting personality of Vloggers (YouTube bloggers) based on their visual and audio content. Identifying personality traits from a static image such as a profile picture is mostly uncharted territory. Recently, in Liu et al. (2016), facial features (i.e., Face++ features) were extracted from Twitter profile pictures to predict personality, however, as the authors concluded, it is more challenging to predict a user’s personality from a static image because less behavioural cues can be extracted from an image compared to other social media behaviours. In this paper we use a similar approach, based on Oxford project features.
For each user we use his/her profile picture. We extract 64 facial features from each profile picture using Microsoft Cognitive Services’ Face API, also known as Project Oxford Face API Cao et al. (2010). The extracted features are face rectangle features to capture the location of the face in the image, face landmark features which include 27-point face landmarks pointing to the important positions of face components, face characteristics including age, gender, facial hair, smile, head position and glasses type. We refer to them as the “Oxford features” in the remainder of the paper.
Similar to the textual model, using the extracted Oxford features, we train a support vector machine with linear kernel, a decision tree, a Naive Bayes and a a logistic regression model per each user characteristic. The logistic regression models have the best overall performance.
In addition to enabling the extraction of features from images, Project Oxford directly provides predictions for age and gender as well, making it a good alternative candidate for the external single-source predictor in PSL-IMG. However, for approx. 10% of the users in our dataset (i.e. 543 out of 5,670), Project Oxford’s native classifier doesn’t produce a meaningful prediction. For this reason, in the remainder of this paper, we use our own logistic regression model trained over the Oxford facial features as the external single-source predictor from images in our PSL models. The age and gender prediction is more accurate using the Oxford API, with AUC score of 0.934 compared with 0.834 for the Oxford features trained with logistic regression model on our validation ser. Similarly, the AUC score of the Oxford API for the age prediction is 0.583 while using the Oxford features trained with logistic regression model give us AUC score of 0.523.
It is important to note that all predictions in this paper are based on images that users have uploaded as their profile pictures. In the pre-processing step we filter single face pictures to enhance the chance of estimating the characteristics of the profile owner, but still this provides no guarantee that these images actually depict the profile owner. Many users upload pictures of their friends, family members or their child as their own profile picture, and therefore our predictions are an estimate of the characteristics of the face in the image and not necessarily the owner of the profile.
4.4 Relational Model Selection
Existing work on inferring user characteristics from relational content focuses typically either on using homophily or heterophily relations among friends Farnadi et al. (2015); McPherson et al. (2001), or indirect relations among users such as shared Facebook page likes Kosinski et al. (2013).
As HL-MRFs is ideally suited for modeling relational data, we defined two novel models, i.e., PSL-DIRECT and PSL-LATENT, for inferences based on user-item relations (see Section 3.2). One would intuitively expect that the accuracy of these models grows with the amount of available page like information. To verify this, we extract a sub-sample dataset with those users from our dataset in Section 4.1 who liked more than 100 pages. To measure the impact on the predictive accuracy of the HL-MRFs models by changing the number of Facebook pages that a user likes, we make five sub-samples by randomly selecting 20, 40, 60, 80 and 100 page likes per user in our original sample.
Our sub-sample consists of 1,725 users, 48,641 Facebook pages and 592,907 page likes. Since in our sub-sample, for each user we have more than 100 likes, we can test the performance of changing the number of page likes from 20, 40, 60, 80 to 100 for the same set of users.
We compare the performance of PSL-DIRECT and PSL-LATENT against the following approaches:
(1) Average baseline: We assign the average label from the training instances to the test instances. This baseline technique does not leverage page like information at all.
(2) Matrix-based (Ridge): We use a matrix representation model as presented in Kosinski et al. (2013). In this model each row represents a user in the dataset and columns represent pages. The value of each matrix entry is one if the user likes that page in the dataset, otherwise it is zero. In Kosinski et al. (2013) Lasso is used to predict the Big Five personality traits, however since our labels are binary, we use a linear least squares classifier with
regularization (ridge regression). We set the parameterto .
(3) Matrix-based (logistic regression (LR)): Similar to the method mentioned above, we train a logistic regression classifier for each characteristic using the list of pages that each user likes.
(4) K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN): We find nearest neighbors of a given user based on the common pages that they like, and aggregate the labels with a majority vote.
(5) User-Page-User (UPU): This approach is based on the graph structure and relies on a similar idea as our latent aggregate model (PSL-LATENT). In this approach, for each page we calculate the average score of the known characteristics of users who like that page. By aggregating the characteristics of users who like a page, we calculate the hidden characteristics of that page. Then, for a given user in our test set, we calculate the average score of the characteristics of the pages that a user likes using the pages’ characteristics.
(5) Non-negative Matrix Factorization (NMF): We use non-negative matrix factorization to transform the user-item matrix and then predict the characteristics using logistic regression algorithm. Due to the sparsity of the matrix, neither changing the number of dimensions (from 2 to 7) nor the predictor enhance the performance of the NMF approach.
Figure 2 presents the results. Both of our proposed HL-MRFs models outperform other models in predicting users’ characteristics. Both of our proposed models incorporate social relational data to collectively infer users’ characteristics. Another interesting observation is that the PSL-LATENT model works much faster than the PSL-DIRECT model. The number of grounded rules (i.e., potentials) for both PSL-LATENT and PSL-DIRECT models depend on the number of page likes (i.e., items), however if we have page likes, the maximum number of potentials for the PSL-LATENT model per each characteristic is while for the PSL-DIRECT model it is .
The PSL-LATENT model not only performs faster and more efficient than the direct model, but it also more accurately predicts the characteristics compared to the PSL-DIRECT model, except for Openness, Extroversion and Agreeableness where the differences between the PSL-LATENT model and the PSL-DIRECT model are not significant.
In all the models that we implemented for this study, increasing the number of page likes from 20 to 40, 60, 80 and then to 100 pages per user boosts the performance. Note that we are not using any information about the items, such as the name of the pages or their content. These results are in line with the results presented in Kosinski et al. (2013). Based on the results presented in Figure 2, we select the latent model (i.e., PSL-LATENT) as the best relational model to capture the user-item relations in our combined user profiling models such as PSL-PROFILE discussed next.
4.5 User Profiling Results
To study the performance gains that can be achieved by using all sources of knowledge from text (status updates), image (profile pictures) and user-item relations (page likes), we systematically make 10 folds by randomly splitting the users in our sample dataset. To have a fair comparison among the approaches that we have used, all the results in Table 2, Table 3, and Table 4 are based on the same training and testing examples per each fold. The first line per each characteristic in all three tables (i.e., Table 2, Table 3, and Table 4), represents the average baseline (equivalent to PSL-PRIOR) results of that characteristic.
4.5.1 Predictions based on one source
Using a single source of information, we present the results of the best model per each source in Table 2. The second line per each characteristic represents the results of the best textual model using the users’ status updates, i.e., where the extracted features are LIWC features and the trained model is logistic regression (PSL-TXT). The third line presents the results of the best visual model using the users’ profile picture by extracting the Oxford features and the trained model is logistic regression (PSL-IMG). And finally, the fourth line per each characteristic, shows the results of the best relational model using HL-MRFs, which is the latent model (PSL-LATENT) that we have introduced in Section 3.2. It is interesting that the relational models to predict users’ age, gender and personality traits outperform the textual and visual models. Our textual models outperform our visual models for all characteristics except for gender where visual content performs significantly better than the textual model. All models using a single source of user data outperform the average baseline in predicting all characteristics. As expected, the performance of the PSL-IMG model in predicting personality traits from a single profile picture performs worse than the textual and relational content.
4.5.2 Predictions based on two sources
Next, we study the performance of the following combinations of two sources of information:
(1) (Joint model) Textual+Visual
: To combine the textual and visual data, we extend the 88 LIWC features extracted from the status updates, with the 64 Oxford features extracted from the profile picture. We then train seven logistic regression models over these combined feature vectors. The second line in Table3 per each characteristic shows the results of this approach. This approach does not involve HL-MRFs models. This method of fusing textual and visual features has been widely used in the literature as an early multi-modal fusion technique.
Another popular early technique for fusing features that has been proposed in the literature is canonical correlation analysis (CCA) Correa et al. (2010). We investigated the use of CCA in user profiling by extracting the canonical correlation features for all pairs of textual, visual and relational contents (i.e., textual and visual, textual and relational, and visual and relational). Using the set of CCA features, we trained models for all the traits with logistic regression. We observed that the results of using CCA are worse than the joint model which simply combines the features. The main reason of getting poor results in using CCA for user profiling is that CCA does not utilize the users’ traits in fusing the features from various sources; it only finds a linear combination of the existing variables for both knowledge sources such that the correlation is maximized for each source. Due to space limitations, the results of this CCA analysis are omitted from the paper.
(2) (HL-MRFs) Textual+Relational: Since our relational model is not a matrix-based model, we use HL-MRFs to combine the relational model with the textual model. We implement this model in PSL using the rules 6, 7, 14, 15, 16, 17, 18 and 19. The results of applying this approach for each characteristic are shown in the third line (PSL-TXT+LATENT) in Table 3.
(3) (HL-MRFs) Visual+ Relational: Similar to the above combination, we use HL-MRFs to combine the visual and relational data. We implement this model in PSL using the same set of meta rules, however we change the source to to use the predicted results from the profile pictures. The last line for each characteristic in Table 3 presents the results of this model (PSL-IMG+LATENT).
All models using two sources of data outperform the average baseline in predicting all characteristics. Beyond this, there are two key observations. The first important observation is that none of the approaches that combine textual and visual information perform particularly well. The joint LR approach does not outperform the single-source textual and visual predictors in predicting any of the traits. Our textual predictor outperforms our visual predictor for predicting age and personality traits, and also outperforms our joint technique. Even for the task of gender prediction where both textual and visual content perform reasonably well, the performance of the joint technique is not better than the visual predictor. The second important observation is that, in contrast to the above, combinations with relational data turn out to be successful. In particular, the HL-MRFs model that combines relational data with textual data not only outperforms the other techniques but also outperforms all single-source predictors in Table 2. The results of the HL-MRFs model that combines relational data with visual data are either better or as good as the single-source predictors in Table 2.
4.5.3 Predictions based on three sources
Finally, we investigate and compare the performance of the following models using all the three sources of users’ data:
(1) Ensemble (linear model): We predict the final characteristics of a user by getting the best prediction results from each source (i.e., the single source models) and apply majority voting. The second line per each characteristic in Table 4 presents the results of this ensemble method, also known as a late fusion approach. Most of the related works on combining various sources of UGC, which mostly focus on text and images and not on relations, are based on a linear combination of the predictions. Some papers propose to use a weighted average instead of averaging, such as Sakaki et al. (2014). There is no existing work on combining textual, visual and relational content.
(2) Joint model
: We extend our “Textual+Visual” model with the relational features, i.e. we extend the feature space of 88 LIWC features and 64 Oxford feature with the pages that each user likes. Since we have 49,372 pages in our sample, our page like matrix is very sparse. Therefore we first apply a truncated singular value decomposition (SVD) to reduce the dimensions and then extend it with our textual and visual features. We use logistic regression to train the models. The results of using this model are presented in the third line of each characteristic in Table4. This technique is known as early fusion approach.
(3) PSL-PROFILE model: We use our HL-MRFs fusion model to combine the predicted results of the visual and textual model with our latent model that leverages the page likes. The last line for each characteristic in Table 4 contains the results of this approach. The architecture of this model is presented in Figure 1.
The results presented in Table 4 indicate that our HL-MRFs based approach for combining the predictions from various sources not only allows the development of expressive and flexible models, but also outperforms the competing techniques. For all characteristics except for Extroversion and Conscientiousness, using our fusion model PSL-PROFILE based on all three sources of information significantly outperforms other combinations including results of using the single source and two sources of knowledge. For Extroversion and Conscientiousness, the outperforming model is based on the combination of the textual and relational models in HL-MRFs. For both characteristics, the predictive model based on the combination of all three sources works better for the positive class, and the combination of textual and relational models works better for the negative class. One important advantage of using our framework PSL-PROFILE is its ability to work with missing data, where we do not have all the information for all users. In this study, to have a fair comparison among the methods, we selected users who have all three knowledge sources, however our framework is directly applicable in a situation where not all users share similar content. In addition, our proposed framework can be used to create a more comprehensive user profile by gathering user data from different social media platform (e.g., Facebook and Twitter). It is important to note that combining only non-relational sources such as textual and visual content in HL-MRFs will produce results similar to an ensemble model. HL-MRFs models are ideally suited for modeling relational data and therefore they gain their power by combining non-relational sources with relational content.
5 User Profiling Results With Missing Data
A framework that leverages all available information about users can learn more accurate user profiles. This is especially useful for platforms where not every user generates the same type of information, and models trained based on one source of information fail to produce accurate user profiles. Examples include users who write status updates but never upload pictures, or users who join social media platforms only to consume knowledge and to relate with each other, rather than producing any textual or visual content themselves. In real-world social media platforms, users may not have a complete user profile. To mimic this behavior, in this section, we measure the performance of our framework PSL-PROFILE when a fraction of users do not have all sources of knowledge in their profile. To study the effects of null values, we design three set of experiments, in each we remove one source of data from a fraction of users and evaluate our PSL-PROFILE performance. We randomly removed , , , and of textual, visual and relational content from the profile of users in our test set. For the case of missing values, the results are equivalent to the results presented in Table 3 when we present the results of combining two sources of knowledge per each user in our dataset.
The results presented in Figure 3 indicate that except for the case of gender prediction that visual content is the best option, the most predictive data source for all the other characteristics is the relational content. Visual content for personality traits prediction and age prediction performs poorly, and missing the content has little influence on the overall results, however for the case of Neuroticism, textual content performs worst that the visual content. In all cases, missing data influence on the results and having more data enhance the prediction, unlikely missing the visual content for the case of age prediction has little influence on the overall results. Given the results presented in Figure 3, one can select data sources suitable for predicting each characteristics.
6 Conclusion and Future Work
In various social media platforms, users have the freedom to generate content in different modalities. Social media users can generate textual content in the form of a blog post, status updates, tweets, comments, etc. Similarly, they can upload or share photos and images as their profile picture. Although there are many research works that treat the various sources independently to infer user characteristics, there are not much related works that combine these predictive models effectively.
In this paper, to model social media users, we combined various sources of user-generated and social relational content. We built a flexible and expandable user profiling model using a probabilistic graphical model, Hinge-loss Markov Random Fields, and used a probabilistic relational framework, Probabilistic Soft Logic (PSL) to implement it. We provided extensive experimental validation of the proposed model for predicting age, gender and personality of Facebook users based on their status updates, profile picture and page likes. Our experimental results show that compared to the results of competing methods that use only one source of information, jointly learning the characteristic, or an ensemble method across the different sources, the multimodal user profiling model that we proposed, provided significantly more accurate user profiles.
As part of our future, we plan to extend our user profiling model to not only incorporate user-item relations but also integrate other social relational content such as direct user-user relations: friendship and follower links. We will evaluate our user profiling framework with homophily relations with PSL rules such as and . Another promising future direction is to incorporate a weight learning mechanism. By using weight learning not only can we determine the relative importance of each information source for predicting each trait, but also we can capture the quality of the content that each user produces in inferring their characteristics. For instance, we may have an accurate predictive model based on the textual content, however for Alice who has only one status update but many page likes, the prediction from her textual content should get a lower weight compared to the predictions from her page like relations. Exploring a way to tailor weights of each source with the quality of the content that each user produces is an open path to explore in the future.
-  (2015) Hinge-Loss Markov Random Fields and Probabilistic Soft Logic. arXiv:1505.04406 [cs.LG]. Cited by: §1, §2.
-  (2013) The youtube lens: crowdsourced personality impressions and audiovisual analysis of vlogs. Proc. of IEEE Transactions on Multimedia 15 (1), pp. 41–55. Cited by: §4.3.
-  (2010) Face recognition with learning-based descriptor. In , pp. 2707–2714. Cited by: §4.3.
-  (2010) Canonical correlation analysis for data fusion and group inferences. IEEE signal processing magazine 27 (4), pp. 39–50. Cited by: §4.5.2.
-  (2008) The revised NEO personality inventory (NEO-PI-R). The SAGE Handbook of Personality Theory and Assessment 2, pp. 179–198. Cited by: §1.
-  (2010) Multiple feature fusion for social media applications. In Proc. of the ACM SIGMOD International Conference on Management of data, pp. 435–446. Cited by: §1.
-  (2015) Collective spammer detection in evolving multi-relational social networks. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Cited by: §1.
-  (2015) Scalable adaptive label propagation in Grappa. In Proc. of IEEE International Conference on Big Data, pp. 1485–1491. Cited by: §4.4.
-  (2016) Computational personality recognition in social media. User Modeling and User Adapted Interaction, pp. 1–34. Cited by: §1, §4.2, §4.2.
-  (2006) The international personality item pool and the future of public-domain personality measures. Journal of Research in Personality 40 (1), pp. 84–96. Cited by: §4.1.
-  (2013) A flexible framework for probabilistic models of social trust. Social Computing, Behavioral-Cultural Modeling and Prediction, pp. 265–273. Cited by: §1.
-  (2015) A multi-source integration framework for user occupation inference in social media systems. World Wide Web 18 (5), pp. 1247–1267. Cited by: §1.
-  (1995) Fuzzy sets and fuzzy logic. Prentice Hall New Jersey. Cited by: §2.
-  (2013) Private traits and attributes are predictable from digital records of human behavior. Vol. 110, pp. 5802–5805. Cited by: §4.4, §4.4, §4.4.
-  (2015) HyPER: a flexible and extensible probabilistic framework for hybrid recommender systems. In Proc. of ACM Conference on Recommender Systems, Cited by: §1.
-  (2002) An extended set of haar-like features for rapid object detection. In Proc. of International Conference on Image Processing, Vol. 1, pp. I–900. Cited by: §4.1.
-  (2016) Analyzing personality through social media profile picture choice. In Proc. of the International AAAI Conference on Web and Social Media, Cited by: §4.3.
-  (2013) Hidden factors and hidden topics: understanding rating dimensions with review text. In Proc. of the 7th ACM conference on Recommender Systems, pp. 165–172. Cited by: §3.2.
-  (2001) Birds of a feather: homophily in social networks. Annual Review of Sociology, pp. 415–444. Cited by: §4.4.
-  (2006) The identity of bloggers: openness and gender in personal weblogs.. In Proc. of AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 163–167. Cited by: §1, §4.2.
-  (1999) Linguistic styles: language use as an individual difference. Journal of Personality and Social Psychology 77, pp. 1296–1312. Cited by: §4.2.
-  (2015) Overview of the 3rd Author Profiling Task at PAN 2015. In Proc. of CLEF, Cited by: §1, §4.2.
-  (2011) Towards a psychographic user model from mobile phone usage. In Proc. of the International Conference on Human Factors in Computing Systems, pp. 2191–2196. Cited by: §1.
-  (2015) DEX: deep expectation of apparent age from a single image. In Proc. of ICCV, ChaLearn Looking at People workshop, Cited by: §1, §4.3.
-  (2014) Twitter user gender inference using combined analysis of text and image processing. V&L Net 2014, pp. 54. Cited by: §1, §4.5.3.
-  (2013) Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS one 8 (9), pp. e73791. Cited by: §4.2.
-  (2010) The Psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29, pp. 24–54. Cited by: §4.2.
-  (2015) Personality and recommender systems. In Recommender Systems Handbook, pp. 715–739. Cited by: §1.
-  (2014) Exploiting social network structure for person-to-person sentiment analysis. Transactions of the Association for Computational Linguistics 2, pp. 297–310. Cited by: §1.
-  (2016) Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In Proc. of the Ninth ACM International Conference on Web Search and Data Mining, pp. 13–22. Cited by: §1.
-  (2015) Multi-dimensional attributes and measures for dynamical user profiling in social networking environments. Multimedia Tools and Applications 74 (14), pp. 5015–5028. Cited by: §1.