Online social media has become a ubiquitous part of our daily life, which allows us to easily share ideas/contents with other users, discuss social events/activities, and get connected with friends. Tumblr, with over 280 million blogs222A Tumblr user typically corresponds to one primary blog, so in this paper users and blogs are used interchangeably. and 130 billion blog posts, is one of the most popular social media apps. The rich content including text, images, and videos, provide great opportunities for advertisers to champion their products to specific groups. In particular, Tumblr offers “native advertisement” that allows advertisers to present their sponsored posts on the users” interface. Native advertising has gained over 3 billion paid ad impressions in 2015 since it was started in 2012 . However, unlike other social networks, Tumblr does not ask for some of the user demographic information, such as gender, during registration. Even though age is a required input during registration, it is difficult to verify the accuracy of this self-declared information. This makes it a challenge to precisely target ads for a group of users with age and gender related demography profiles. To improve the performance of user specific ad targeting, it is important for Tumblr to infer users’ age and gender from rich content generated by users.
Several attempts have been made for this task, e.g., Grbovic et al.  proposed a gender and interest targeting framework that leverages user-generated data. Their model has lifted user engagement of ads, however, they did not take into consideration age prediction, which is one of the key factors for targeted ads. Furthermore, their gender prediction models only use features from blog contents and user activities (the so-called “cumulative features”). We believe that in addition to the cumulative features, user interactions, represented by the Tumblr following graph, can also be used for age and gender prediction, as users who followed each other tend to have similar interests or background. It is worth mentioning that the model in 
does use blogs a user follows as a categorical feature for a linear classifier. However, it only exploits 1-hop neighbor information of the Tumblr following graph. The rich network structure provides further indications on how users interact. Furthermore, with the development in deep learning, advanced models like convolutional neural network (CNN) and multilayer perceptron (MLP) can typically produce better performance, which the past work like does not explore.
Hence, to achieve better performance for age and gender prediction, in this paper, we propose a graph based approach with deep learning techniques that leverages the multitude of information encoded by the Tumblr following graph.
The main contributions of our paper include:
Data: We leverage rich cumulative features including user activities and post contents. Furthermore, we construct a large Tumblr following graph with hundreds of millions of vertices and billions of edges.
Graph base methods: We apply graph embedding and label propagation techniques to (1) generate rich features from the Tumblr following graph through node embedding, which is shown to be useful to improve users demography prediction; (2) directly leverage the label propagation algorithm to boost the prediction performance.
Deep learning: We apply deep learning models including CNN and MLP to further improve the performance of age and gender prediction.
Evaluation: We conduct empirical studies to show that the graph based and deep learning approaches can improve the AUC and accuracy performance by relatively for gender prediction and the accuracy performance by for age prediction, compared to the baseline model , and classifiers like GBDT, XGBoost, etc.
The rest of this paper is organized as follows. Section II describes the Tumblr data, and Section III discusses our proposed methods, including network embedding, label propagation and deep learning. Empirical studies are shown in Section IV. Finally we summarize the related work in Section V, and conclude our work with future directions in Section VI.
In this section, we discuss the rich Tumblr data we use, and the labels we create.
Ii-a Tumblr Data
We use both following graph and cumulative features.
Following Graph. In Tumblr, users can follow other users, which forms the Tumblr following graph. Formally, we define as the following graph, where is the set of users and is the set of following relations. Note that even though the following graph is directed in Tumblr, we assume is an unweighted undirected graph, as our study focuses on age/gender prediction where we assume two users with a following relation tend to share mutual interests in their background. In this paper, consists of hundreds of millions of nodes and billions of edges.
Cumulative Features. We utilize “cumulative features” , which contain blog content and activities including music (artist), follow (blogs followed), likes (of posts), photo captions, post tags, blog titles, and blog descriptions, etc.
Note that the baseline model  only used cumulative features with LogisticRegression for the Tumblr age/gender prediction. It is interesting to study how the following graph can boost the performance.
Ii-B Label Construction
Tumblr does not ask for gender information when a user signs up. Furthermore, although it does ask for user age, no independent verification is done, and Tumblr believes that the age information is unreliable. Therefore, it is a challenge to create ground truth labels. Grbovic et al.  leveraged the US census data that associates people’s names with gender to create gender labels. However, their approach suffers from a natural issue that some names can be neutral (e.g., “Avery”). Instead, we use age and gender ground truth data provided internally at Yahoo, which is at a large-scale and more reliable. This golden set provides us with good quality of ground truth label information. In total there are about million ground truth age and gender labels.
Iii Proposed Methods
In this section, we will discuss our approaches for age and gender prediction, including network embedding, label propagation and deep learning. We first enrich the cumulative features with graph features generated by network embedding and label propagation, and then apply deep learning techniques for the prediction. In addition, we also use label propagation as a direct tool for the prediction, which is shown to be efficient in the experiments.
Iii-a Network Embedding
When applying LogisticRegression on the cumulative features for the baseline model, we found that follow and like are the two most important features, which suggests that blogs that a user follows/reads is a good indicator on the user’s demographic. Intuitively, users tend to follow others’ activities with similar age/gender. Using follow
features directly is akin to using words as unigram features for natural language processing. It works to some extent, but fails to uncover underlying connections among blogs. For example, if two blogs have a large overlap of followers, though they are not directly followed by each other, they are somewhat related. However, network embedding (mapping blogs to a high dimensional space based on their follow relations) can capture such relationships.
In this paper, we apply word2vec [16, 18] to generate rich network features. Word2vec is one of the most popular embedding techniques to capture syntactic and semantic relationships of entities. It leverages a multilayer perceptron (MLP) with two architectures, bag-of-words (BOW) and skip-gram. Levy and Goldberg  found that the skip-gram model, which uses a word to predict its neighborhood words, is equivalent to a matrix factorization with the matrix to be factored containing the word-word point-wise mutual information. With this broad understanding of word2vec in mind, the technique is applicable to tasks beyond those of traditional natural language processing tasks. It can be deployed to applications with entities and their co-occurrence patterns. Since blogs and their followers have such a pattern, we leverage word2vec for our task.
Implementation details. Network embedding has been well-studied recently. For example, DeepWalk 
constructs “sentences” of vertices by random walks, and then applies a word2vec model to sentences to get vector representation of vertices. Vertices that appear frequently together in sentences tend to be similar. Our implementation is similar to LINE, where we create a sentence from a vertex and its neighbors. Specifically, each blog is combined with all the blogs it follows. This set of blogs are randomly permuted and is treated as one sentence. All together, the following graph produces hundreds of millions of “sentences” with billions of words, all of which are embedded into 50 dimensions using word2vec in the skip-gram mode.
We use a word2vec implementation with a minimum word count of 5, which results in a dictionary size of 46 million. In order to use blog embedding successfully as features, we embed a majority of the blogs. For blogs without an embedding, we construct its embedding by averaging the embeddings of its neighbors that have an embedding. We denote the embedding algorithm as Emb.
Iii-B Label Propagation
is an efficient method to generate graph features. However, it is an unsupervised approach which does not take into account label information for training. Next we develop semi-supervised learning algorithm based on label propagation (calledLabelProp), which takes into account labels and multi-hop neighborhood information. LabelProp serves as an approach to directly predict age/gender, as well as generate additional graph features. In addition, compared to the time consuming training of Emb for large graphs, LabelProp is a faster algorithm that can be naturally implemented in a distributed environment with a parallel fashion using Spark on Hadoop.
LabelProp spreads existing age/gender labels over the following graph. The intuition is that users tend to follow blogs with the similar background (e.g., ladies fashion). The age/gender of such a popular blog, defined by its followers, in turn defines the age/gender of its follower whose demographic information may be unknown to us. Next we will introduce the LabelProp algorithm and its variants, and then discuss how to leverage it for feature generation.
The LabelProp Algorithm. Given the Tumblr following graph , we denote the set of labeled users as where , and the set of unlabeled users is denoted as where . For , let be ’s label. The label propagation is an iterative process: at each iteration, a node’s label propagates to its 1-hop neighbors. The process starts with , and once a node gets labels from its neighboring nodes, it will propagate labels in the next iterations. The process ends when it converges or reach a pre-defined number of iterations. Note that during the process, we fix the labels of nodes in , i.e., only nodes without labels at the beginning will iteratively update labels.
Algorithm 1 shows the Label Propagation (LabelProp) algorithm. We iteratively update label at each iteration (Line 4), where is the set of neighboring nodes of that have labels at previous iterations. The algorithm ends until we reach
iterations. Note that we also add a hyperparameterto control contributions of neighboring labels. If is low, we focus more on the neighboring information and vice versa.
Implementation details. The label here can be either gender or age. For gender, we assume as female and as male. For age, we first split age into buckets
, and use one hot encoding to create a label vector withentries. We run LabelProp for each entry. It is natural to develop LabelProp in parallel, as each node updates its label by independently querying its neighbors at each iteration. We implement LabelProp under the Pregel framework  on Spark, which is a state-of-the-art message passing parallel model for large distributed graphs.
Note that we do not run LabelProp until it converges, but stop at a predefined maximum number of iterations because of the scalability consideration (we may need a large amount of iterations to converge). Furthermore, as we mentioned above, the motivation of LabelProp is that users that have following relations have similar demography. Hence, labels propagated to a node at a large number of iteration , may “contaminate” the prediction of , as the propagated labels at -th iteration can be different from the true label.
Variants of label propagation. We propose two variants of LabelProp for the gender prediction in order to investigate how labels from nodes with different distance make impact on the label prediction. Note that it is straightforward to generalize the analysis to the multiclass age prediction. The main idea is that instead of using a constant hyperparameter , we adaptively change the parameter. As the value of the iteration increases, we decrease the contribution of incoming labels from neighboring nodes. Hence, labels that are far away become less important in Algorithm 1 (Line 4). If they do not affect the performance, then the label prediction results with large iterations will remain competitive compared to the case with small iterations. The followings are two alternative propagation strategies we propose:
In the first strategy, and is the number of the iterations. As the iteration value increases, the impact of labels from the neighboring information decreases exponentially. In the second strategy, . We propagate male and female labels respectively for each node, and at the end normalize them as the final label for each node. As shown in the experiments, these two strategies can empirically demonstrate that for a node , labels from its close neighbors are more important than long-distance labels for gender prediction.
Label Propagation as Features. In LabelProp, we can directly learn labels by the iteration rule (Algorithm 1 Line 4). As shown in the next experiments, it provides high quality predictions compared to baselines. However, neither does it take into account the cumulative features, nor features from Emb. Emb generates graph features in an unsupervised manner without considering existing label information. To leverage the label information to improve the prediction performance, we use LabelProp to general features as well.
The idea of LabelProp as feature generators comes from the ensemble methods: we divide the labeled data uniformly at random into partitions. For partition , we denote it as . We run LabelProp on the whole graph with the labeled data for each respectively to get the learned label , and concatenate learned labels. For each node , label propagation features are represented as a vector . For , if it is in partition (meaning we know its label ), instead of using as the feature at dimension , we set . This is because directly putting as features will cause overfitting.
Iii-C Deep Learning Models
Deep learning models have been extensively studied due to their extraordinary performance in a variety of areas including computer vision, natural language processing, robotics, etc. The baseline model used theLogisticRegression method for age/gender prediction. In this paper, we aim to leverage deep learning models to improve users’ demography prediction performance. To start with, we used multilayer perceptron (MLP), a class of feed-forward neural net with more than three layers, with embedding as features. In addition, we experimented with a convolutional neural network (CNN) , which has shown promising and reliable performances across a range of text classification tasks by leveraging embedding results. This is a word embedding based CNN using the text features.
As a next step we also tried ResNet , whose architecture has more layers (18 for our case), with each layer also connecting to 2 layers in the front. This skip connection adds the ability to train deeper models to avoid overfitting. The accuracy for this model is higher than MLP with 3 hidden layers. However, ResNet took almost times time to train. So we decided not to pursue this direction.
Implementation details. MLP and CNN are two advanced models which typically take long times to train over large datasets, especially for the Tumblr data with hundreds of millions of users. To speed up the training process, we sample the same number of female and male examples. Empirical studies over the whole dataset indeed show the model trained from the sampled users can still provide us with convincing performance over the baseline model (LogisticRegression). We implemented MLP and CNNMLP and CNN
, we use cross entropy as the objective, and stochastic gradient descent (SGD) as the optimizer.
We conduct our experiments using the Spark framework on Hadoop with executors, each of which has memory. LabelProp is implemented using the Spark GraphX library, while Emb is implemented using a multi-threaded version of word2vec.
We obtain about million labeled data as discussed before. To evaluate the performance, we randomly sample of the labeled data as the training set, and the rest as the testing set. We fit our model on the training set, and compute accuracy on the testing data. For gender prediction we also report AUC as it is a binary classification task.
In short, we demonstrate that CNN and MLP can achieve of the accuracy for the multi-class (7 classes) age prediction, which outperforms the baseline model by in accuracy. In addition, we show that adding features from Emb and LabelProp can greatly improve AUC for gender prediction by compared to the baseline model. Finally, we explore different performance of LabelProp as iteration and hyperparameter change, as well as the variants of LabelProp.
Age Prediction. As discussed above, for the age prediction, we split ages into buckets as labels, and use CumF and Emb as features. The baseline is a generalized multiclass LogisticRegression, which uses softmax instead of logistic function as the objective, and CumF. In addition to CNN and MLP, we also combine them together (CNN +MLP) to boost the performance, by concatenating the last hidden layers of both and feeding to them to the cross entropy optimizer as the output. Table I shows the results: combining CNN and MLP produces the best results by of performance improvement in accuracy compared to the baseline model LogisticRegression. In addition, MLP outperforms CNN by .
|(baseline with CumF)|
Gender Prediction. Table II shows the results of using different features on the Hadoop system. CumF is the result that only uses cumulative features , which are used in the baseline. Emb and LabelProp are the results that use the embedding and label propagation features respectively, CumF+LP is the result that combines both. All is the result that puts cumulative, node embedding and label propagation features together. First, All produces the best AUC by of performance improvement. Second, it is interesting that CumF+LP also gives us competitive results: it achieves only of the performance loss compared to All. Finally, when we separate features respectively, LabelProp outperforms CumF and Emb by .
In addition to testing our model on Hadoop, we also use the following classifiers as baselines for the gender prediction on a single machine. Table III shows the quality (quantifying comparisons) of different feature integration approaches in terms of AUC and accuracy. For Emb, we embed users into dimensional feature space, while for LabelProp, we are able to obtain features. First, our model MLP outperforms other classifiers by up to of performance improvement in both AUC and accuracy. Second, stacking Emb and LabelProp features together can improve AUC and accuracy (up to of the improvement). Compared to the baseline LogisticRegression model with cumulative features, MLP with Emb and LabelProp features improved the AUC by 5%.
Sensitivity of LabelProp. We also evaluate the sensitivity of LabelProp for hyperparameter and iteration number . Table IV shows the results. First, when iteration number , LabelProp just queries labels from 1-hop neighbors, hence, the performance is not as good as examining multi-hop neighbors. Second, as the number of iterations goes up, the AUC first increases and then slightly decreases. This empirically confirms our intuition that labels that are close to a node make more contributions to the labeling result of . In addition, we notice that the convergence rate is different for different parameters. In general, it takes around iterations for LabelProp to converge. However, in practice, since we are interested in the prediction performance, we only need to run a few iterations to obtain the best result. With regard to the parameter , we observe that the smaller is, the fewer iterations are needed to achieve the best AUC. For example, when , we only need iterations, while iterations are needed for the best AUC when .
Variants of of LabelProp. The variants of LabelProp proposed in Section III are used to examine how the distance of neighbors affects results, and how many iterations are sufficient to get good prediction. Different from the LabelProp with the constant , both -strategy and -strategy decrease the weight on neighboring labels. As shown in Table V, both strategies get the best AUC around - iterations, which suggests that labels within or -hops are very important for LabelProp to obtain good results. As the iteration number increases, the AUC almost remains the same especially for the -strategy, which suggests that labels with large distance make little impact on the prediction.
V Related Work
Demographic Ad Targeting. Personalized advertising is a common ads targeting strategies, which has been studied broadly [6, 4, 14]. It tries to display the most relevant ads to each individual. Demographic ad targeting is one of the personalization tactics, which effectively targets users based on their age, gender, etc. Grbovic et al.  first studied the gender based ad targeting in Tumblr. However, their model is based on learned gender labels, which sometimes can be unreliable. To address their issue, we provide high accuracy labels to improve gender prediction. In addition to demographic ad targeting, another type of personalized advertising used in Tumblr is interest targeting which recommends ads based on their categories [7, 22, 2].
Network Embedding. The graph embedding problem tries to generate vector representation of nodes. Previous work such as locally linear embedding , IsoMap , and spectral techniques [5, 1, 3], treats networks as matrices. They tend to be slow and do not scale to large networks Recently, leveraging the deep learning techniques, several novel algorithms were proposed to learn feature representations of nodes [19, 24, 27, 8]. DeepWalk  and Node2Vec  extends the word2vec model  to networks by leveraging random walks to generate “word context”. SDNE  and LINE  learn embedding results on directed graphs that maintain the first and second order proximity of nodes. In this paper, our Emb implementation leverages Line to learn graph features of users.
The idea of label propagation has been widely studied in the machine learning literature. Zhu et al. first leveraged label propagation for a graph based semi-supervised learning algorithm, and Zhu et al.  further proposed an iterative algorithm from a close form solution for label propagation. After that label propagation has been applied to many domains, such as multimedia including image and video data  and information retrieval like relevance and keyword search [9, 26]. In addition, Rao and Yarowsky  proposed a parallel label propagation algorithm under the MapReducde framework. However, none of the above work studied the problem with large-scale data like ours, and implemented the label propagation algorithm on large distributed systems in practice, nor were these applied to demography classification.
Vi Conclusion and Future Work
In this paper, we study the important and challenging problem of Tumblr age and gender prediction for large scale Tumblr data, which can be used to better target sponsored ads against specific demographic audiences. We propose to add the following graph information to the existing cumulative features in Tumblr to enhance the age and gender prediction performance. In particular, we leverage graph embedding and label propagation techniques to generate informative user features, and apply deep learning models including CNN and MLP to utilize these features. Experimental results demonstrate that our approaches outperforms the baseline models by relatively of accuracy improvement for age, and of accuracy and AUC improvement for gender.
As future work, we would like to incorporate Tumblr’s raw age signals either as a feature, or as a way to gain more trustworthy labels in the age prediction model training. In addition, the Tumblr data consists of many images. If we could leverage the images and their annotations as extra features, then this could possibly boost the performance further.
Learning spectral clustering. In NIPS, Vol. 16. Cited by: §V.
-  (2014) Who to follow and why: link prediction with explanations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1266–1275. Cited by: §V.
-  (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, Vol. 14, pp. 585–591. Cited by: §V.
-  (2008) Computational advertising and recommender systems. In Proceedings of the 2008 ACM conference on Recommender systems, pp. 1–2. Cited by: §V.
-  (1997) Spectral graph theory. Vol. 92, American Mathematical Soc.. Cited by: §V.
-  (2009) Matchmaker, matchmaker. Communications of the ACM 52 (5), pp. 16–17. Cited by: §V.
-  (2015) Gender and interest targeting for sponsored post advertising at tumblr. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 1819–1828. External Links: Cited by: 4th item, §I, §I, §II-A, §II-A, §II-B, §IV-A, §V.
-  (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. Cited by: §V.
Manifold-ranking based image retrieval. In Proceedings of the 12th annual ACM international conference on Multimedia, pp. 9–16. Cited by: §V.
Deep residual learning for image recognition.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §III-C.
-  (2014-08) Convolutional neural networks for sentence classification. pp. . Cited by: §III-C.
-  (2013) Graph-based semi-supervised learning with multi-modality propagation for large-scale image datasets. Journal of Visual Communication and Image Representation 24 (3), pp. 295–302. Cited by: §V.
-  (2014) Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2177–2185. Cited by: §III-A.
-  (2013) Know your personalization: learning topic level personalization in online services. In Proceedings of the 22nd international conference on World Wide Web, pp. 873–884. Cited by: §V.
-  (2010) Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 135–146. Cited by: §III-B.
Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR, Cited by: §III-A.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §V.
-  (2014) Glove: global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) 12. Cited by: §III-A.
-  (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §III-A, §V.
-  (2009) Ranking and semi-supervised classification on large scale graphs using map-reduce. In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 58–65. Cited by: §V.
-  (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §V.
-  (2014) Recommending tumblr blogs to follow with inductive matrix completion.. In RecSys Posters, Cited by: §V.
-  (2015) LINE: large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15, Republic and Canton of Geneva, Switzerland, pp. 1067–1077. External Links: Cited by: §III-A.
-  (2015) Line: large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. Cited by: §V.
-  (2000) A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §V.
-  (2005) Graph based multi-modality learning. In Proceedings of the 13th annual ACM international conference on Multimedia, pp. 862–871. Cited by: §V.
-  (2016) Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1225–1234. Cited by: §V.
-  (2004) Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328. Cited by: §V.
-  (2003) Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pp. 912–919. Cited by: §V.