Microblog services like Twitter have become important social platforms for users to share their media contents. Retweet function is usually considered to be key mechanism that enables users to repost someone else’s tweets . In social media sites, users who follows other users are termed as ”followers” and users who are followed are termed as ”followees”. Central problem of retweet prediction is to model tweet sharing behavior that users repost tweets along followee-follower links so that more users are informed in SMS, which has attracted considerable attention recently in [2, 3, 1, 4, 5].
Existing approaches for retweet prediction [3, 1, 4, 5, 6] learn user preference model from their past retweeted textual tweets, and predict users’ tweet sharing behavior in SMS. With the popularity of mobile devices, the amount of user-generated image tweets grows tremendously. For example, there are about 17.2% of tweets associated with images in Twitter . So it is important to study problem of image retweet prediction in SMS. We give a simple example of image retweet prediction in Figure 1. As there is not discriminative feature representation for tweets with image  and SMS data is sparse  , existing proposed retweet prediction methods are ineffective to image retweet prediction problem.
Currently, most of existing retweet prediction methods [3, 1, 4, 5, 6] learn semantic representation of tweet based on hand-crafted feature (e.g., bag-of-words). Recently, high-level visual features for image representation with pre-trained CNNs have shown success in various visual recognition tasks [7, 8]
. Since image tweets are always visual data, it is natural to employ deep convolutional neural networks to learn visual representation of image tweets. On the other hand, image tweets are often associated with textual context information such as users’ comments and captions 
. Contextual image tweet information usually convey important messages and can gain better understanding of tweets. Since textual contextual information is always sequential data with variant length, we employ deep recurrent neural networks to learn its semantic representation. We employ multi-modal neural network learning method  to learn joint image tweet representation from their multi-modal contents, which provides complementary information with different modalities.
Sparsity of SMS data is also a challenging issue for image retweet prediction. In SMS sites, network between image tweets and users is constructed through users’ retweet relations on image tweets. Usually, each user only retweets a few image tweets and thus SMS network is sparse. Inspired by homophily hypothesis , it is possible and reasonable to assume that collective information from users’ followees and users’ retweeted tweets can be jointly considered for tackling the sparsity problem of image retweet prediction. It is observed that social impact for retweet behavior varies between user and his/her different followees. We thus employ attention mechanism  to adaptively incorporate users’ followee preference for jointly predicting targeted user’s image retweet behavior.
In this paper, we study image retweet prediction problem from viewpoint of attentional multi-faceted ranking network learning. We first propose heterogeneous image retweet modeling (IRM) network that exploits multi-modal image tweets, users’ retweet behaviors and their following relations for image retweet prediction. We introduce textually guided multi-modal neural networks with two sub-networks, where recurrent neural networks learn semantic representations of image tweets’ contextual information, and convolutional neural networks learn visual representations. Multi-modal fusion layer is added to learn joint image tweet representation from texually guided multi-modal neural networks. We develop attentional multi-faceted ranking method with introduced multi-modal neural networks, such that multi-faceted ranking metric is implicitly embedded in user preference representation for image retweet prediction. Main contributions of this paper are summarized as follows:
Unlike previous studies, we present image retweet prediction problem from viewpoint of attentional multi-faceted ranking network learning. We propose heterogeneous IRM network to model the problem, which exploits multi-modal image tweets, users’ retweet behaviors and their following relations.
We develop attentional multi-faceted ranking method with textually guided multi-modal neural networks to learn user preference representation based on retweeted tweets and following relations for image tweet prediction.
We evaluate our method’s performance using dataset collected from Twitter. Extensive experiments show that our method outperforms several state-of-the-art solutions to the problem.
The rest of this paper is organized as follows. In Section II, we present the problem of image retweet prediction from the viewpoint of attentional multi-faceted ranking network learning. Many experimental results are presented in Section III. We provide a brief review of the related work about retweet prediction in Section IV. Finally, we provide some concluding remarks in Section V.
Ii Image Retweet Prediction via Attentional Ranking Network Learning
In this section, we first present the problem of image retweet prediction from the viewpoint of heterogeneous image retweet modeling network learning. We then propose the attentional multi-faceted ranking method based on social impact of the relative followee preference. We devise the textually guided multi-modal network to guide the image region through the user’s contextual attention, thus jointly representing the image tweet and its captions or comments.
Ii-a The Problem
Before presenting the problem, we first introduce some basic notions and terminologies. Since image tweets are always visual data, it is natural to employ deep convolutional neural networks  to learn visual representation of image tweets. Given a set of image tweets , we first learn the image tweets’ convolutional feature by the pretrained CNN’s last convolutional layer as , where is a 3-dimension feature containing both the location and the visual information of image. We also learn the image’s visual embedding by the same convolutional neural networks’ last fully connected layer as . In the next section we introduce how to use the visual embedding feature to guide the location of image’s convolutional feature. On the other hand, textual context information of image tweets such as users’ comments and captions also gain better understanding of image tweets. We thus employ deep recurrent neural networks  to learn its semantic representation. Given a set of textual contexts , we take recurrent neural networks’ last hidden layer as semantic embedding of textual contexts by , where denotes semantic embeddings of the image tweet’s different captions and comments. We denote the joint image tweet representations by , where is joint representation of the -th image tweet based on its visual representation and contextual semantic representation . We denote the set of ranking models for user preference representation by , where is preference representation embedding of the -th user.
Recently, existing approaches for retweet prediction [3, 1, 4, 5, 6] learn user preference model from their past retweeted textual tweets, and then predict users’ tweet sharing behavior. Unlike previous studies, we propose attentional multi-faceted ranking metric heterogeneous IRM (i.e., image retweet modeling) network that exploits multi-modal image tweets, users’ past retweet behaviors and their following relations for image retweet prediction. We denote proposed heterogeneous IRM network by , where the set of nodes is composed of the joint image tweet representations and user preference representations , the set of edges consists of users’ past retweeted behaviors and their following relations . We denote the retweeted behaviors between image tweets and users by matrix , where the entry if the -th image tweet is retweeted by the -th user, otherwise, . We then consider the following relations between users by matrix , where if the -th user follows the -th user. We next denote the set of the -th user’s followees by (i.e., if ), and the total set of users’ followees by . We illustrate a simple example of the heterogeneous IRM network in Figure 2(a).
We then derive the heterogeneous triplet constraints from the IRM network as the users’ relative preference for training the attentional multi-faceted ranking networks. We consider that the users express the explicit positive interest on the image tweets when he/she retweeted them in the IRM networks. On the other hand, following the existing Twitter analysis works , we consider that the users may show the implicit negative interest on the non-retweeted image tweets of their followees. This is because the non-retweeted image tweets by the followees are more likely to be seen but disliked by the user.
Given retweeted behavior between the -th image tweet and the -th user (i.e., ), we sample a non-retweeted image tweet of ’s followees as . Following popular homophily hypothesis , we also incorporate users’ followee preference for image tweet modeling. We then model users’ relative preference by ordered tuple , meaning that “the -th user prefers the -th image tweet to the -th one”. Let denote set of ordered tuples obtained from IRM network for a set of image tweets and users. We then consider ordered heterogeneous tuples as the constraints for learning user preference representations. More formally, we aim to learn the multi-faceted ranking metric function for image retweet prediction. For any , the inequality holds:
where is the multi-faceted ranking model of the -th user for image retweet prediction. The function is the personalized ranking model of the -th user and models the social impact of the relative followee preference on the -th user. We then define the personalized ranking function by , where is the relative preference of the -th user and is the joint representation of the -th image tweet. We will present the details of the function in the next section.
Using the notations above, we define the problem of image retweet prediction from the viewpoint of attentional multi-faceted ranking network learning as follows. Given the input image tweets with their associated contexts , the set of ordered tuples for users’ relative preference , and the heterogeneous IRM network , our goal is to learn the multi-faceted ranking metric representations for all user preferences and the multimodal image tweet contents , and then rank the image tweets for the targeted users for image retweet prediction. The image tweets to user are then ranked according to the multi-faceted user preference function .
Ii-B Attentional Textually Guided Ranking Network Learning
In this section, we propose the attentional multi-faceted ranking network with the textually guided multi-modal layer for image retweet prediction. We present the learning process in Figures 2(a), 2(b) and 2(c).
We first choose proper multi-modal neural networks for image tweet representation in IRM networks, which consists of two sub-networks: a deep convolutional neural network for visual representation of image data, and a deep recurrent neural network for semantic representation of textual contextual data. These two sub-networks interact with each other in a multi-modal fusion layer to form the joint representation, illustrated in Figures 2(b) and 2(c). For the visual representation of the image data, we use the activation of the last convolutional layer and last fully connected layer of the proposed convolutional neural network Inception Net , which has been widely used in many visual representation tasks [16, 17, 18]. Meanwhile, we train the LSTM networks  for the associated contexts of image tweet, and then take the output the last LSTM cell as its semantic representation. Considering the fact that the associated context of image tweets may be in the paragraph of several sentences with user comments and captions, we split them into sentences to learn the semantic representations by LSTM networks.
In order to learn the joint representation of image tweets with different modalities, a simple way is to set up a linear sum multi-modal layer that connects the textual representation oriented from recurrent neural network part and visual representation oriented from convolutional neural network part. For different textual representation oriented from recurrent neural network part
, we fuse them by an additional max-pooling layer. We then map the activation of the two layers (i.e., the visual representation of image tweets and the semantic representation of textual contexts) into the same multi-modal feature fusion space and add them together to obtain the activation of the multi-modal fusion layer, given by
where denotes the element-wise addition for the next location representation with different modalities. The matrix and are weight matrices. The is the element-wise scaled hyperbolic tangent function, which forces the gradients into the most non-linear value range and leads to a faster training process, proposed in .
However, such simple method doesn’t take advantage of the contextual relation between different comments and their matched image tweets. In order to get a more relevant representation of image tweeets and textual comments, we set up the textually guided multi-modal fusion layer that connects the textual representation oriented from recurrent neural network part and visual representation oriented from convolutional neural network part, illustrated in Figure 3. Because each image tweet have many captions and comments from its publisher and subscribers, we suppose that different comments express both associated and extended information of image. Therefore, instead of using the visual feature from the last fully connected layer of pretrained CNN, we use the image’s convolutional feature which contains both the location and visual feature of image to generate the appropriate representation of users’ focus on the image tweet.
In order to locate the image’s proper region for the user’s focus, we denote the location mapping vector by, where represents the x-axis and y-axis coordinate in the image convolutional feature respectively. Given the convolutional feature and the location mapping vector , the conv locating in Figure 3 extracts a multi-dimensional feature from centered at . We then fuse the textual embedding with our extracted convolutional feature using the attention mechanism. Given the semantic representation of -th comment of -th image and the multi-dimensional feature , the textual attention score for the -th comment and the -th convolutional feature is given by
where and are parameter matrices. The
is the bias vector andis the parameter vector for computing the textual attention score. For each followee in , its score activation is given by . Thus, the textual impact on the -th image convolutional feature is given by .
In order to get the high-level representation of our attentional image feature which is combined with the textual information, we use another recurrent neural network to infer the location of next image region. With as the input of -th time step, the RNN’s hidden state and output are denoted by and . The visual feature here from pretrained CNN’s last fully connected layer is taken as the image’s global information to facilitate the locating process. Given the image’s visual embedding and the RNN’s -th step’s output , the next location mapping vector is given by
where denotes the element-wise addition with different modalities. The matrix and are weight matrices. The is the element-wise scaled hyperbolic tangent function.
We define the above described procedure as the textually guide process . By stacking our model with the recurrent neural network, we can obtain the next location mapping vector and the RNN’s hidden state by
where is the transformative matrice to compute the joint representation of the -th image tweet. We initialize the with the random strategy and obtain the last iteration’s output as the joint representation of the -th image tweet.
We then present the attentional multi-faceted ranking function learning for image retweet prediction. Inspired by the attention mechanism [13, 20], we design the social impact function based on the ordered tuple constraints as follows. Given the user preference representations , the social preference attention score for the -th user and his/her -th followee user in is given by
where and are parameter matrices to model the preference correlation between the user and his/her followee. The is the bias vector and is the parameter vector for computing the social preference attention score. For each followee in , its preference activation is given by . Thus, the the social impact of the relative followee preference on the -th user is given by .
Given the formulation of personalized ranking function and social impact function
, we now design the attentional multi-faceted ranking loss function as follows:
where the ranking function , the superscript indicates the positive preference and denotes the negative preference. We denote the hyper-parameter () controls the margin in the loss function.
We next introduce the details of our proposed attention multi-faceted ranking network learning. We denote all the model coefficients including neural network parameter, the joint image tweet representations and user preference representation by . Therefore, the objective function in our learning process is given by
is the trade-off parameter between the training loss and regularization term. To optimize the objective, we employ the stochastic gradient descent (SGD) with diagonal variant of AdaGrad.
Iii-a Data Preparation
Iii-A1 Information of dataset
We collect data from Twitter, which is a popular microblog services for Web users to share their media contents 
. Users usually show their positive preference on image tweets by retweeting them in social media sites. We crawl profile of the users including their past retweeted image tweets and their following relations. In total, we collect 9,900 users, 7,193 image tweets and 29,501 following relations. We report that the average time that an image tweet retweeted by some collected users is 12.2, and the average number of image tweets that some collected user retweets is 9.1. Average number of followees among the collected users is 6.2, and maximum number of followees is 162. Average number of words in the context of image tweets is 9.1, and its standard variance is 5.4. For each retweet behavior (i.e.,) of the user, we sample two negative image tweets from his/her followees. We sort users’ retweet behaviors based on their timestamp and use the first 60%, 70% and 80% of data as training set and the remaining for testing, so the training and testing data do not have overlap. The validation data is obtained separately from the training and testing data. The dataset will be released later for further study.
Figure 4 shows the distribution of image retweets for our dataset. We can find that the number of retweet for each image is mostly within the range of 1 and 10. The distribution of all users’ followees and followers are also shown in Figure 4, which indicates that the number of followee/follower for each user is between 3 and 7. The figure also shows a similar distribution between the number of every user’s follower and followee.
Iii-A2 Image Feature Extraction
We pre-process our collected image tweets as follows. We extract the global feature from the last fully-connected layer of pretrained Inception-V4 network for the image’s feature embedding, which is the 1536-dimensional vector. To meet with the demand of our textually guided multi-modal network, we also extract the image feature from the last convolution layer of the same pretrained network, thus obtaining 8x8x1536 feature vector for each image.
Iii-A3 Text Feature Extraction
We first filter all emoji and interjection for all captions and comments. Then for each word in sentences, we employ the pretrained Glove 
model to extract the semantic representation. The dimension of word vector is 300. Specifically, we set four sentences for each image tweet and the length of each sentence is 12. For those image tweets which have less than 4 captions or comments, we duplicate the last comment for padding. The size of vocabulary is set to 12500 for our dataset. Therefore, we use the tokenunk for the out-of-vocabulary word and eos to mark the end of caption or comment.
Iii-B Evaluation Criteria
Retweet prediction task usually aims at providing top image tweets to a user in most online media services. To evaluate the effectiveness of our method in terms of top- ranked image tweets, we adopt two ranking-based evaluation criteria, Precision@K  and AUC [24, 25, 26] to evaluate the performance of image retweet prediction. Given test set of users and image tweets , we denote predicted ranking of the top image tweets from test set for a certain user by , where size of ranking list is .
Iii-C Performance Comparison
We evaluate performance of our method AMNL (only use linear fusion method) and AMNL+ (use the textually guided multi-modal network) with five other state-of-the-art solutions to problem of image retweet prediction as follows
CITING  method is the context-aware image tweet modelling framework, which explores both the image’s intrinsic context and extrinsic context such as Web URL for the learning of image tweets.
VBPR  method is the scalable factorization model, which encodes the visual signal of product by deep network to predict user’s feedback.
FAMF  method is the optimization of Bayesian analysis for item recommendation, where a personalized ranking criteria and generic algorithm are designed for the item prediction task
ADABPR  method is the improvement of pairwise algorithm for recommendation systems, where a non-uniform item sampler is used to accelerate the convergence of learning network.
RRFM  method is the relaxed ranking-based factor model, which builds two-level optimization for the pairwise ranking
Objective value and running time versus the number of epochs.
Existing retweet prediction methods are mainly based on low-rank factorized ranking model. Methods FAMF, ADABPR and RRFM learn factorized ranking metric based on pairwise preference constraints. Methods CITING and VBPR are feature-aware factorized ranking algorithms based on pairwise preference constraints and feature of item contents.
We extract feature of item contents as follows. Input words of all textual information are initialized by pre-calculated word embeddings and input visual representation of image tweets are initialized by Inception-Net. Parameters of the neural networks used to get the representations of visual content and textual context are updated during training process. The weights of deep neural networks are randomly initialized by a Gaussian distribution with zero mean in our experiments. Following experimental setting in[2, 24], we consider the associated textual contexts as the side information of the method CITING and the visual representation of image tweets as the side information of the method VBPR. The hyper-parameters and parameters which achieve the best performance on the validation set are chosen to conduct the testing evaluation. We set the learning rate to 0.01 for the gradient method. We think the top 3 tweets that users want to retweet can reveal the discriminative characteristics of the tweets that users want to retweet. So we evaluate the ranking performance of all methods on the quality of the top 3 ranked image tweets. In order to show the effectiveness of our textually guided multi-modal fusion, we also evaluate the ranking performance of AMNL with the simple fusion method we described above. To exploit the effect of the visual representation of image tweets and the semantic representation of the associated contexts to the performance of our method, we denote our AMNL method with visual representation of image tweets only by AMNL, our AMNL method with semantic representation of the associated contexts only by AMNL, and our AMNL+ method with visual representation of image tweets only by AMNL+
show evaluation results of all methods on ranking criteria Precision@1, Precision@3 and AUC, respectively. Evaluation were conducted with different ratio of data as training set from 60%, 70% to 80%. We report result value of all methods using three ranking evaluation criteria. We then report performance of our model with different modalities, where dimension of user preference representation is set to 400, and 80% of data is used for training. All other parameters and hyperparameters are also chosen to guarantee the best performance on the validation set. We evaluate the average value of all three criteria on six methods. These experimental results reveal a number of interesting points:
The methods with content feature as the side information for learning the ranking metric, CITING and VBPR, outperform the low-rank factorized ranking metric methods FAMF, ADABPR and RRFM, which suggests that the deep neural networks with both image tweets and the associated context information is critical for the problem of image retweet prediction.
Compared with other ranking methods with the side information, our method AMNL achieves better performance than the method VBPR, and our method AMNL achieves better performance than the method CITING, respectively. This suggests that the multi-faceted ranking metric is important for the problem.
Compared with our methods AMNL, our method AMNL+ achieves better performance. This suggests that through the textually guided multi-modal fusion method, image tweets can be better jointly represented with different captions or comments which contain the associated semantic information, thus obtaining better performance in the image retweet prediction.
In all cases, our AMNL+ method achieves the best performance. This shows that the attentional multi-faceted ranking network learning framework that exploits both the joint image tweet representation of multi-modal image tweets and their associated contexts, and multi-faceted ranking metric can further improve the performance of image retweet prediction.
We also illustrate the experiment results of our AMNL+ on some users’ image retweet prediction in Figure 9(a) and (b). The Figure 9(a) shows the user and the images published by the user’s followees. Their low ranking scores indicate that the nonretweeted image tweets published by followees are more likely to be seen but disliked by the user. The Figure 9(b) shows the predicted image and its comments which has a high score. This suggests that the image predicted by our method is more preferable for the user in Figure 9(a). It’s also worth mentioning that some specific words are matched with objects marked by the same color in the image, which shows a great effectiveness of the guidance of comments and captions.
Iii-D Hyper-Parameter Analysis
In our approach, there are three essential parameters, which are the dimension of user preference representation, the dimension of recurrent neural network units and the margin in the loss function. In order to study the effect of such hyper-parameters, we vary the dimension of user preference representation from 100 to 500, the dimension of recurrent neural network units from 200 to 1200 and the margin value in the loss function from 0.1 to 0.9. We show the effect of these hyper-parameters using 60% of the data for training on Precision@1, Precision@3 and AUC in Figures 5(a), 5(b) and 5(c). As is shown in the figures, the change of parameters has a relatively stable effect on the performance of model and the variation tendency is the same. We also find out that with the change of the dimension of user preference representation, all three criteria changes in a larger range than the other two hyper-parameters, which indicates that the dimension of user preference representation is essential for users’ image retweet prediction. Our method achieves best performance when the dimension of user preference representation is set to 400, the dimension of recurrent neural network units is set to 1000 and the margin in the loss function is set to 0.6 with different proportions of data for training.
The updating rule for training our proposed attentional multi-faceted ranking network learning method is essentially iterative. Here we investigate how our AMNL method converges. Figures 8(a) and 8(b) show the convergence and running time curves of AMNL method, respectively. The -axis denotes the iteration number in both figures. The -axis in Figure 8(a) denotes the objective value and the -axis in Figure 8(b) shows the running time of our proposed method. Each epoch contains 22,881 iterative updates. We set the dimension of user preference representation to 400, and use 80% of the data for training. We show that our method converges after 9-th epoch and the computation cost is less than 50 minutes. This study validates the efficiency of our method.
Iii-E Ablation Study
In this part, we evaluate the contribution of our technical components: the textually guided multi-modal fusion network and the social impact function. We also evaluate the effect of visual representation of image tweets, semantic representation of the associated contexts and the joint image tweet representation to our model.
To understand the contribution of components and the effect of different media for our model, we propose the ablation study and illustrate the results in Table IV. We explore our model in these ways: our AMNL method means that we use the visual representation of image tweets only. Our AMNL method means that we only semantic representation of the associated contexts. Our AMNL+ model means that we input the average pooling of convolutional feature of image tweets directly into recurrent neural networks in the textually guided multi-modal fusion network, instead of using attention mechanism with the textual representation. Our AMNL and AMNL+ model means that we calulate the ranking function directly for two models without using the social impact function. As is shown in Table IV, we also find some interesting results.
Compared with our methods AMNL and AMNL, our method AMNL achieves better performance. This suggests that the attentional multi-faceted ranking network learning framework which exploits the joint image tweet representation of multi-modal image tweets and their associated context can get better performance than the attentional multi-faceted ranking network learning framework which only exploits the representation of tweets’ images or the representation of tweets’ contexts.
Compared with result of AMNL+, AMNL+ gets better score among all three criteria. This suggests that the social impact function can help improve the performance of our method. The experiment results of AMNL and AMNL further proves that our above result is consistent among different components.
Iv Related Work
Retweet prediction has been studied deeply and extensively in recent years. It’s a method to perform information dissemination for today’s social media. In order to model user’s retweet behavior accurately, we divide the current research work into three aspects: feature selection for user retweet behavior, representation for retweet modeling and user retweet ranking. In this section, we briefly review some related work in all three aspects.
Iv-a Feature Selection of User Retweet Behavior
How to choose the relevant factor that affect user’s retweet behavior has been well studied.  examines four types of features which are related to the retweetability of each tweet by training a prediction model.  collects both content and contextual features from Twitter dataset and evaluates their affect for retweet behavior. The experiment indicates the great contribution of contextual features to the retweet rate, while the distribution of past tweets does not influence the user’s retweetability.  integrates the social role recognition and information diffusion into a whole framework, modeling the interplay of user’s social roles.  examines a number of semantic features to learn the tweets’s sentiment representation.  explains that user retweet behavior can be better understood in the unfamiliar area by assessing different predictive models and features.  studies the factor of user posting behavior, which consists of breaking news, posts from user’s social friends and user’s intrinsic interest. The authors also present a latent model to further prove the effectiveness of these factors.  models both the user’s social relation and other factors to perform the retweet prediction. In addition to that, the authors also take the extent difference of social correlation into consideration by dividing them into different categories, such as friends or co-workers. Different from existing methods, our method gathers image tweets and their captions or comments. We suppose that different captions or comments not only represent extensive semantic information for the image, but also have correlation with each other because of the user’s socical interaction.
Iv-B Representation for Retweet Modeling
There has been a number of studies aiming at modeling user’s retweet representation. 
predicts the human retweet behavior by a machine learning approach based on the passive-aggressive algorithm. develops a learning to-rank framework to explore various retweet features.  considers about the task from the perspective of temporal information diffusion. The model learns a diffusion kernel in which the infection time in cascades is represented by the distance of nodes in the projection space.  proposes a factorization machine with a ranking-based function, which is extended from a recommendation model, to integrate various aspects in Twitter dataset.  converts the task of retweet modeling into the conversational practice, in which the authorship and communicative fidelity are negotiated. 
treats the retweet behavior as a three-dimensional tensor of tweets, tweet authors and their followers and represents them simultaneously by tensor factorization. collects the interplay of users and contextual information, using a support vector data description to predict the future interplay.  deploys the matrix completion approach to optimize the factorization of user’s retweet representation. Despite that previous studies have explored a wide range of representation learning for the user’s retweet modeling, most of them do not specifically take account of the jointly representation of image retweets and their captions or comments, for which we propose the textually guided multi-modal network and evaluate its effectiveness using Twitter dataset.
Iv-C User Retweet Ranking
Central problem of retweet prediction is to model tweet sharing behavior that users repost tweets along followee-follower links and rank all tweets emerged in social media so that more users are informed in SMS, which has attracted considerable attention recently in [2, 3, 1, 4, 43, 5]. Chen et. al.  exploit various contexts for image understanding and retweet prediction. Firdaus et. al.  propose a retweet prediction model by considering user’s author and retweet behaviors. Zhang et. al. 
propose non-parametric models to combine structural, textual, and temporal information together to predict retweet behavior. Zhang et. al. propose deep neural networks to incorporate contextual and social information. Wang et. al.  present a recommendation model to solve the problem of whom to mention in a tweet. Feng et. al.  propose the feature-aware factorization model to re-rank the tweets, which unifies the linear discriminative model and the low-rank factorization model. Peng et. al.  model the retweet behavior and rank the tweets using conditional random fields. Zhang et. al.  employ the social influence locality for ranking the user’s retweets rate. Unlike previous studies, we formulate the problem of image retweet prediction from the viewpoint of attentional multi-faceted ranking network learning, which can be solved by the negative sample based ranking metric learning with multi-modal neural networks.
In this paper, we introduced problem of image retweet prediction from viewpoint of attentional multi-faceted ranking network learning. We propose heterogeneous IRM network that exploits both users’ past retweeted image tweets, associated textual context and users’ following relations. We present a novel attentional multi-faceted ranking network learning method with the textually guided multi-modal neural networks to learn joint image tweet representations and user preference representations, such that multi-faceted ranking metric is embedded in representations for prediction. We evaluate performance of our method using dataset from Twitter. Extensive experiments demonstrate that our method can achieve better performance than several state-of-the-art solutions.
-  Q. Zhang, Y. Gong, Y. Guo, and X. Huang, “Retweet behavior prediction using hierarchical dirichlet process.” in AAAI, 2015, pp. 403–409.
-  T. Chen, X. He, and M.-Y. Kan, “Context-aware image tweet modelling and recommendation,” in ACM Multimedia. ACM, 2016, pp. 1018–1027.
-  S. N. Firdaus, C. Ding, and A. Sadeghian, “Retweet prediction considering user’s difference as an author and retweeter,” in ASONAM. IEEE, 2016, pp. 852–859.
-  Q. Zhang, Y. Gong, J. Wu, H. Huang, and X. Huang, “Retweet prediction with attention-based deep neural network,” in CIKM. ACM, 2016, pp. 75–84.
-  W. Feng and J. Wang, “Retweet or not?: personalized tweet re-ranking,” in CIKM. ACM, 2013, pp. 577–586.
-  J. Zhang, J. Tang, J. Li, Y. Liu, and C. Xing, “Who influenced you? predicting retweet via social influence locality,” TKDD, vol. 9, no. 3, p. 25, 2015.
-  C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in NIPS, 2013, pp. 2553–2561.
-  Z. Zhao, Q. Yang, H. Lu, T. Weninger, D. Cai, X. He, and Y. Zhuang, “Social-aware movie recommendation via multimodal network learning,” IEEE Transactions on Multimedia, vol. 20, no. 2, pp. 430–440, 2018.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, “Multimodal fusion for multimedia analysis: a survey,” Multimedia systems, vol. 16, no. 6, pp. 345–379, 2010.
-  Z. Yuan, J. Sang, C. Xu, and Y. Liu, “A unified framework of latent feature learning in social media,” TMM, vol. 16, no. 6, pp. 1624–1635, 2014.
-  M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv:1508.04025, 2015.
-  K. Chen, T. Chen, G. Zheng, O. Jin, E. Yao, and Y. Yu, “Collaborative personalized tweet recommendation,” in SIGIR. ACM, 2012, pp. 661–670.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
-  H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual translation embedding network for visual relation detection,” in CVPR, 2017.
W. Zhao, Z. Guan, H. Luo, J. Peng, and J. Fan, “Deep multiple instance hashing for object-based image retrieval,” in
Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017, pp. 3504–3510.
-  Z. Zhao, J. Lin, X. Jiang, D. Cai, X. He, and Y. Zhuang, “Video question answering via hierarchical dual-level attention network learning,” in Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017, pp. 1050–1058.
-  Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,” in Neural networks: Tricks of the trade. Springer, 2012, pp. 9–48.
-  Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang, “Video question answering via hierarchical spatio-temporal attention networks,” in International Joint Conference on Artificial Intelligence (IJCAI), vol. 2, 2017.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  A. Java, X. Song, T. Finin, and B. Tseng, “Why we twitter: understanding microblogging usage and communities,” in SNA-KDD. ACM, 2007, pp. 56–65.
-  J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in EMNLP, 2014.
-  R. He and J. McAuley, “Vbpr: visual bayesian personalized ranking from implicit feedback,” arXiv:1510.01784, 2015.
-  S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr: Bayesian personalized ranking from implicit feedback,” in UAI. AUAI Press, 2009, pp. 452–461.
-  H. Li, R. Hong, D. Lian, Z. Wu, M. Wang, and Y. Ge, “A relaxed ranking-based factor model for recommender system from implicit feedback,” 2016.
-  S. Rendle and C. Freudenthaler, “Improving pairwise learning for item recommendation from implicit feedback,” in WSDM. ACM, 2014, pp. 273–282.
-  Z. Xu and Q. Yang, “Analyzing user retweet behavior on twitter,” 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 46–50, 2012.
-  B. Suh, L. Hong, P. Pirolli, and E. H. hsin Chi, “Want to be retweeted? large scale analytics on factors impacting retweet in twitter network,” 2010 IEEE Second International Conference on Social Computing, pp. 177–184, 2010.
-  Y. Yang, J. Tang, C. W. ki Leung, Y. Sun, Q. Chen, J.-Z. Li, and Q. Yang, “Rain: Social role-aware information diffusion,” in AAAI, 2015.
E. Kouloumpis, T. Wilson, and J. D. Moore, “Twitter sentiment analysis: The good the bad and the omg!” inICWSM, 2011.
-  S. A. Macskassy and M. Michelson, “Why do people retweet? anti-homophily wins the day!” in ICWSM, 2011.
-  Z. Xu, Y. Zhang, Y. Wu, and Q. Yang, “Modeling user posting behavior on social media,” in SIGIR, 2012.
-  B. Hu, M. Jamali, and M. Ester, “Learning the strength of the factors influencing user behavior in online social networks,” 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 368–375, 2012.
-  S. Petrovic, M. Osborne, and V. Lavrenko, “Rt to win! predicting message propagation in twitter,” in ICWSM, 2011.
-  Z. Luo, M. Osborne, J. Tang, and T. Wang, “Who will retweet me?: finding retweeters in twitter,” in SIGIR, 2013.
-  S. Bourigault, C. Lagnier, S. Lamprier, L. Denoyer, and P. Gallinari, “Learning social network embeddings for predicting information diffusion,” in WSDM, 2014.
-  L. Hong, A. S. Doumith, and B. D. Davison, “Co-factorization machines: modeling user interests and predicting individual decisions in twitter,” in WSDM. ACM, 2013, pp. 557–566.
-  D. Boyd, S. Golder, and G. Lotan, “Tweet, tweet, retweet: Conversational aspects of retweeting on twitter,” 2010 43rd Hawaii International Conference on System Sciences, pp. 1–10, 2010.
-  E.-P. Lim and T.-A. Hoang, “Retweeting: An act of viral users, susceptible users, or viral topics?” in SDM, 2013.
-  Y. Matsubara, Y. Sakurai, C. Faloutsos, T. Iwata, and M. Yoshikawa, “Fast mining and forecasting of complex time-stamped events,” in KDD, 2012.
-  B. Jiang, J. Liang, Y. Sha, and L. Wang, “Message clustering based matrix factorization model for retweeting behavior prediction,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, ser. CIKM ’15. New York, NY, USA: ACM, 2015, pp. 1843–1846. [Online]. Available: http://doi.acm.org/10.1145/2806416.2806650
-  B. Wang, C. Wang, J. Bu, C. Chen, W. V. Zhang, D. Cai, and X. He, “Whom to mention: expand the diffusion of tweets by@ recommendation on micro-blogging systems,” in Proceedings of the 22nd international conference on World Wide Web. ACM, 2013, pp. 1331–1340.
-  H.-K. Peng, J. Zhu, D. Piao, R. Yan, and Y. Zhang, “Retweet modeling using conditional random fields,” in ICDMW. IEEE, 2011, pp. 336–343.