1 Introduction
In recent years, as more and more people enjoy the services provided by Facebook, Twitter, and Weibo, etc., information cascades have become ubiquitous in online social networks, which has motivated a huge amount of researches Cheng et al. (2014); Li et al. (2017); Sun et al. (2017); Li et al. (2018); Liu et al. (2018). An important research topic is information cascade prediction, whose purpose is to predict who will be infected by a piece of information in the future Saito et al. (2008); Guille and Hacid (2012); Wang et al. (2017); Gao et al. (2017), where infection refers to the actions that users reshare (retweet) or comment a tweet, a photo, or other piece of information Bourigault et al. (2014).
While lots of methods have been proposed for information cascade prediction Saito et al. (2008); GomezRodriguez et al. (2011); Bourigault et al. (2016); Zhang et al. (2019); Varshney et al. (2017)
, the existing works often suffer from three defects. First, the existing works often focus on predicting the probability that whether a node will be infected in the future given nodes infected in the past, but ignore the prediction of infection order, i.e., which nodes will be infected earlier or later than others. However, predicting the infection order is important in many scenarios. For example, it is helpful for blocking rumor spread to know who will be the next infected node
Guille et al. (2013); Yang et al. (2020). Second, the existing methods often assume that information diffusion follows a parametric model such as Independent Cascade (IC) model
Goldenberg et al. (2001) and SusceptibleInfected (SI) model Radcliffe (1977). In real world, however, information diffusion processes are so complicated that we seldom exactly know the underlying mechanisms of how information diffuses Steeg and Galstyan (2013). At last, the existing works often assume that the explicit paths along which information propagates between nodes are observable. Yet in many scenarios we can only observe that nodes get infected but can not know who infects them Bourigault et al. (2016). For example, in viral marketing, one can track whether a customer buys a product but it is difficult to exactly determine who influences her/him.In this paper, we aim at the problem of information cascade prediction without requirement of the knowledge about the underlying diffusion mechanism and the diffusion network. This is not easy due to the following two major challenges:

Cascading Characteristics The probability that a node is infected by a cascade and the relative infection order mainly depend on its cascading characteristics that reveal its relation to other nodes in that cascade. The existing methods often just take into consideration the static structural properties of nodes, for example, the node neighborship in a static social network. However, the cascading characteristics of a node intuitively vary in different cascades, and different cascades can contain totally different infection ranges or orders of nodes. For example, in some cascades, one node may often get infected by certain nodes, but in other cascades, it may be more susceptible to different nodes, even though the node structural properties remain the same. Intuitively, different contents often lead to different cascading characteristics of a node and result in different underlying mechanisms in different cascades. However, in many situations it is not easy to recognize the content (i.e., what is diffused) and its underlying diffusion mechanism (i.e., why and how it is diffused). For example, we often do not know what virus is being propagated in a plague, but when and which nodes are infected can be observed. To make prediction for cascades in such situations, we have to explicitly model the observable cascading characteristics which arguably implicitly captures the effect of the unobservable content and underlying mechanism as well. Therefore, what cascading characteristics of nodes should be captured and how to capture them are crucial to our purpose.

Cascading Nonlinearity Information cascades are often nonlinear. The nonlinearity comes from two perspectives. One is the nonlinearity of the dynamics of the information cascades, and the other is the nonlinearity of the structure of the social networks on which cascades exist. The nonlinearity will cause the problem when nodes spread the content of a cascade, they exhibit nonlinear cascading patterns (e.g., emergence pattern) that the existing shallow models can not effectively recognize. How to capture the nonlinear features of nodes in information cascades is also a critical challenge for our problem.
Inspired by the impressive network representation learning ability of deep learning that has been demonstrated by the recent works
Wang et al. (2016); Liao et al. (2018); Chang et al. (2015), we propose a novel model called Deep Collaborative Embedding (DCE) for prediction of infection and infection order in cascades, which can learn the embeddings without assumption about the underlying diffusion model and diffusion networks. The main idea of DCE is to collaboratively embed the nodes with a deep architecture into a latent space where the closer the embeddings of the two nodes are, the more likely the two nodes will be infected in the same cascade and the closer their infection time will be.Different from the traditional network embedding methods Wang et al. (2016); Tang et al. (2015); Xie et al. (2019); Perozzi et al. (2014), which mainly focus on preserving the static structural properties of nodes in a network, DCE can capture not only the node structural property but also two kinds of node cascading characteristics that are important for the prediction of node infection and infection order. One is the cascading context, which reveals the temporal relation of nodes in a cascade. The cascading context of one node consists of two aspects, including the potential influence it receives from earlier infected nodes and their temporal relative positions in a cascade. The other kind of cascading characteristic captured by DCE is the cascading affinity, which reveals the cooccurrence relation of nodes in cascades. Cascading affinity essentially reflects the probability that two nodes will be infected by the same cascade. Higher cascading affinity between two nodes indicates that it is more likely for them to cooccur in a cascade. Intuitively, the cascading characteristics of nodes reflect the effect of the unobservable underlying diffusion mechanisms and diffusion networks. Therefore, by explicitly preserving the node cascading characteristics, the learned embeddings also implicitly capture the effect of unobservable underlying diffusion mechanisms and diffusion network, which makes it feasible to make cascade predictions in terms of the similarity between embeddings in the latent space. As we will see later in the experiments, due to the ability to capture the cascading characteristics, the embeddings learned by DCE show a better performance in the task of infection prediction.
To effectively capture the nonlinearity of information cascades, we introduce an autoencoder based collaborative embedding
architecture for DCE. DCE consists of multilayer nonlinear transformations by which the nonlinear cascading patterns of nodes can be effectively encoded into the embeddings. DCE can learn embeddings for nodes in a collaborative way, where there are two kinds of collaborations, i.e.,
cascade collaboration and node collaboration. At first, in light of the observation that a node often participates in more than one cascade of different contents, for a node DCE can collaboratively encode its cascading context features in each cascade into its embedding. In other words, the embedding of a node is learned with the collaboration of the cascades the node participates, which we call the cascade collaboration. At the same time, DCE can concurrently embed the nodes, during which the embedding for a node is generated under the constraints of its relation to other nodes, i.e., its cascading affinity to other nodes and its neighborship in social networks. In other words, the embeddings of nodes are learned with the collaboration of each other, which we call the node collaboration.The major contributions of this paper can be summarized as follows:

We propose a novel model called Deep Collaborative Embedding (DCE) for information cascade prediction without requirement of the knowledge about the underlying diffusion mechanism and the diffusion network. The node embeddings learned by DCE are beneficial to not only the infection prediction but also the prediction of infection order of nodes in a cascade.

We propose an autoencoder based collaborative embedding framework for DCE, which can collaboratively learn the node embeddings, preserving the node cascading characteristics including cascading context and cascading affinity, as well as the structural property.

The extensive experiments conducted on real datasets verify the effectiveness of our proposed model.
The rest of this paper is organized as follows. We give the preliminaries in Section 2. The cascading context is defined and modeled in Section 3. In section 4 we illustrate our proposed model and in Section 5 we analyze the experiments results. Finally, we briefly review the related work in Section 6 and conclude in section 7.
2 Preliminaries and Problem Definition
Symbol  Description 

the number of nodes  
the number of cascades  
network  
the set of nodes  
the set of edges  
the set of cascades  
the cascading context matrix of cascade ,  
the cascading affinity matrix, 

the structural proximity matrix,  
the infections time of node in cascade  
the row vector of node in 

the learned embedding vector of node 
2.1 Basic Definitions
We denote a social network as , where is the nodes set comprising nodes and is the edges set. Let be the set of information cascades. An information cascade () observed on a social network is defined as a set of timestamped infections, i.e., , where represents node is infected by cascade at time . We also say if node participates in cascade . Additionally, we use to denote the set of nodes infected by cascade before time , and the set of nodes which haven’t been infected before . Note that the nodes in might or might not be infected by after .
2.2 Problem Definition
The target problem of this paper can be formulated as: given a set of information cascades observed on a given social network , we want to learn embeddings for nodes in , where the learned embeddings can preserve the cascading characteristics and structural property of nodes, so that closer embeddings indicate that the corresponding nodes are more likely to be infected by the same cascade with the closer infection time.
3 Modeling Cascading Characteristics
Cascading characteristics of a node reveal its relation to other nodes in information cascades, which are crucial to the prediction of node infection and infection order. In this section, we will define two kinds of cascading characteristics, the cascading context and the cascading affinity, which will be encoded into the learning embeddings.
3.1 Cascading Context
As mentioned before, the cascading context of a node in a cascade is supposed to capture its temporal relation to other nodes in that cascade, which includes the potential influence imposed by other nodes and their temporal infection order. There are three factors we have to consider for the definition of cascading context. First, the infection of a node is intuitively caused by the potential influence of all the nodes infected before it, and the influence declines over time. Second, the cascading context should be specific to a cascade, as one node might have different cascading contexts in different cascades. Finally, in the same cascade, the infection of one node can be influenced neither by the nodes that are infected after it, nor by the nodes that are not infected at all. Based on these ideas, we can define the cascading context as follow:
Definition 1
(Cascading Context): Given the set of cascades on a social network of nodes, , the cascading context of the nodes involved in cascade () is defined as a matrix . The entry at the th row and the th column of represents the potential influence from node to , which is defined as
(1) 
where is the infection time of in cascade and is the decaying factor. The cascading context of node in cascade is defined as the row vector .
As we will see later, will be fed into our model as it quantitatively captures ’s temporal relation (including the influence and the relative infection position) to the other nodes in a cascade .
3.2 Cascading Affinity
As mentioned before, cascading affinity of two nodes measures the similarity of them with respect to the cascades, which can be defined in terms of their cooccurrences in historical cascades as follow:
Definition 2
(Cascading Affinity): Given the set of cascades on a social network of nodes,i.e., , the cascading affinity of two nodes and is represented by the entry at the th row and the th column of the cascading affinity matrix , which is defined as the ratio of the cascades involving both and , i.e.,
(2) 
Definition 2 tells us that for two given nodes, the more number of cascades involving both of them, the higher their cascading affinity, and intuitively the more similar their preferences to the contents of cascades. In this sense, cascading affinity of two nodes implies that how close their embeddings should be in the latent space.
4 Deep Collaborative Embedding
In this paper, we propose an autoencoder based Deep Collaborative Embedding (DCE) model, which can learn embeddings for nodes in a given social network, based on the cascades observed on the network, so that the learned embeddings can be used for cascade prediction without knowing the underlying diffusion mechanisms and the explicit diffusion networks. In this section, we first present the architecture of the Deep Collaborative Embedding (DCE) model in detail, and then we describe the objective function and the learning of DCE.
4.1 Architecture of DCE
The architecture of DCE is shown in Fig.1. As we can see from Fig.1, DCE learns the embeddings through two collaborations, the cascade collaboration and the node collaboration. With the cascade collaboration, DCE can generate the result dimensional embedding for a node by collaboratively encoding its cascading contexts, (). At first, DCE will learn intermediate embeddings for by autoencoders, respectively, each of which corresponds to a cascade. The autoencoder for cascade () takes the ’s cascading context in the cascade as input, and then generates the intermediate embedding of in cascade , , through its encoder part consisting of nonlinear hidden layers defined by the following equations:
(3)  
where is the output vector of th hidden layer of th autoencoder taking as input, is the parameter matrix of that layer, and is the corresponding bias.
At last, the result embedding is generated by fusing the intermediate embeddings () through the following nonlinear mappings:
(4)  
Symmetrically, the decoder part of the autoencoder for cascade is defined by the following equations:
(5)  
In the above Equations (3), (4), and (5), the parameter matrices and
, and the bias vectors
and are the parameters that will be learned from training data.At the same time, with the node collaboration, DCE can concurrently embed the nodes into latent space, by which the similarity between nodes in the social network can be captured into the learned embeddings. Particularly, to regulate the closeness between any two embeddings and , DCE will impose the constraints of the cascading affinity and structural proximity between and via Laplacian Eigenmaps, which will be described in detail in next subsection.
4.2 Optimization Objective of DCE
4.2.1 Loss Function for Cascade Collaboration
At first, as described in last subsection, autoencoders defined by Equations (3), (4), and (5) fulfill the cascade collaboration for embedding by reconstructing its cascading contexts . The optimization objective for this part is to minimize the reconstruction error between and
, of which the loss function is defined as follow:
(6)  
where and are the original cascading context matrix and the reconstructed cascading context matrix of cascade , respectively, which are defined in Definition 1.
The cascading context vectors are often sparse, which may leads to undesired vectors in the embeddings and the reconstructed if the sparse vectors are straightforwardly fed into DCE. To overcome this issue, inspired by the idea used in the existing works Wang et al. (2016); Zhang et al. (2017) which assign more penalty (corresponding to larger weight) to the loss incurred by nonzero elements than that incurred by zero elements, the can be redefined as
(7)  
where denotes the Hadamard product, and the th column vector of the matrix is the weight vector assigned to cascading context . An entry if , otherwise .
4.2.2 Loss Functions for Node Collaboration
Next we introduce the loss function for node collaboration. As mentioned in last subsection, through the node collaboration the embeddings will preserve the cascading affinity of nodes in cascades and the structural proximity of nodes in social network. Following the idea of Laplacian Eigenmaps, we weight the similarity between two embeddings with the cascading affinity of their corresponding nodes, which leads to the following loss function:
(8) 
where is the cascading affinity between and defined in Equation (2). The insight of Equation (8) is that a penalty will be imposed when two nodes with high cascading affinity are relocated far away in the latent space.
Similarly, we also weight the similarity between two embeddings with the structural proximity of their corresponding nodes, which leads to the following loss function:
(9) 
where is the structural proximity between and in social network. Note that it does not matter how to define , and theoretically, the node structural proximity of any order can be used for . In this paper, we employ the firstorder proximity Tang et al. (2015) to define . To be more specific, if and are connected by a link in the network, otherwise .
Let be the laplacian matrix of the cascading affinity matrix , i.e., , where is diagonal and . Let be the structural proximity matrix whose entry at th row and th column is , and similarly, let be its laplacian matrix, i.e., , where is also diagonal and . Then we can rewrite the Equations (8) and (9) with their matrix forms:
(10) 
and
(11) 
where is the embedding matrix whose th column is .
4.2.3 The Complete Loss Function
By combining , , and , we can define the complete loss function of DCE as follow:
(12) 
where is a 2norm regularizer term to avoid overfitting, and , , and are nonnegative parameters used to control the contributions of the terms.
4.3 Learning of DCE
DCE model can be learned using Stochastic Gradient Descent (SGD), the gradients of which are given by the follow equations:
(13) 
(14) 
(15) 
(16) 
where the partial derivatives on the right side of the equations can be computed using backpropagation.
The learning process is given in Algorithm 1. Note that in each iteration, the parameters are updated (Line 5) once the embeddings , are concurrently generated (Line 3). Such concurrent embedding scheme ensures the cascading context can be encoded into the embeddings as well as the cascading affinity and the structural proximity of nodes can be preserved at the same time.
5 Experiments
In this section, we will present the details of experiments conducted on realworld datasets. The experiments include two parts, the tuning of the hyperparameters and the verifying of DCE. Particularly, to verify the effectiveness of DCE, we will check whether the embeddings learned by DCE improve the performance of the prediction of information cascades on the real world datasets.
5.1 Settings
5.1.1 Datasets
We verify the effectiveness of our method through experiments conducted on three real datasets, Digg, Twitter, and Weibo, which are described as follows:
Digg is a website where users can submit stories and vote for the stories they like Hogg and Lerman (2012). The dataset extracted from Digg contains 3,553 stories, 139,409 users, and 3,018,197 votes with timestamps. A vote for a story is treated as an infection of that story, and the votes for the same story constitute a cascade. In addition, a social link exists between two users if one of them is watching or is a fan of the other one.
Twitter is a social media network which offers microblog service Weng et al. (2013). The dataset extracted from Twitter comprises 510,795 users and 12,054,205 tweets with timestamps, where each tweet is associated with a hashtag. If the hashtag is adopted in one user’s tweet, we consider it infects that user. The tweets sharing the same hashtag are treated as a cascade, and 1,345,913 cascades are contained in the dataset. In addition, the users are linked by their following relationships.
Weibo is a Twitterlike social network Zhang et al. (2013). The dataset extracted from Weibo contains 1,340,816 users and their 31,444,325 tweets with timestamps. A retweeting action of a user is viewed as an infection of the retweeted tweet to that user. The retweetings of the same tweet constitute a cascade, and the dataset contains 232,978 cascades of different tweets. The users in Weibo network are also connected by following relationships.
The statistics of the datasets are summarized in Table 2. On each dataset, we randomly select 60% of the total cascades as training set, 20% as validating set, and the remaining 20% as testing set.
Dataset  #Nodes  #Links  Avg. Degree  #Cascades  #Infections  Avg. Cascade Length 

Digg  139,409  1,731,658  12.4  3,553  3,018,197  849.5 
510,795  14,273,311  27.9  1,345,913  12,054,205  9.0  
1,340,816  308,489,739  230.1  232,978  31,444,325  135.0 
5.1.2 Baselines
In order to demonstrate the effectiveness of DCE, we compare it with the following baseline methods:
NetRate NetRate is a generative cascade model which exploits infection times of nodes without assumptions on the network structure Rodriguez et al. (2011)
. It models information diffusion process as discrete networks of continuous temporal process occurring at different rates, and then infers the edges of the global diffusion network and estimates the transmission rates of each edge that best explain the observed data.
CDK CDK maps nodes participating in information cascades to a latent representation space using a heat diffusion process Bourigault et al. (2014). It treats learning diffusion as a ranking problem and learns heat diffusion kernels that defines, for each node of the network, its likelihood to be reached by the diffusing content, given the initial source of diffusion. Here we adopt the withoutcontent version of CDK considering that other baselines and our approach are not designed to deal with diffusion content.
TopoLSTM TopoLSTM uses directed acyclic graph as the diffusion topology to explore the diffusion structure of cascades rather than regarding it as merely a sequence of nodes ordered by their infection timestamps Wang et al. (2017). Then it puts dynamic DAGs into a LSTMbased model to generate topologyaware embeddings for nodes as outputs. The infection probability at each time step will be computed according to the embeddings.
EmbeddedIC EmbeddedIC is a representation learning technique for inference of Independent Cascade (IC) model Bourigault et al. (2016). EmbeddedIC can embed users in cascades into a latent space and infer the diffusion probability between users based on the relative positions of the users in the latent space.
DCEC DCEC is a special version of the proposed DCE, where the node collaborations of cascading affinity and structural proximity are removed while only the cascade collaboration of cascading contexts is kept.
5.2 Cascade Prediction
In this paper, we evaluate the learned embeddings by applying them to the task of information cascade prediction, the details of which are described as follows.
For a testing cascade , given a set of seed nodes which are infected before, we predict the infection probabilities for the remaining nodes and their infecting order. To be more specific, the size of the seed set will be of the total number of the nodes. Let be the set of nodes that are predicted before time step , and then the probability that one node will be infected at is
(17) 
where is the probability that is infected by . Our idea of computing is based on the similarity between the embeddings, which is defined as
(18) 
where and are embedding vectors of nodes and , respectively, and the similarity is measured by Euclidean distance. For each uninfected node , its infection probability can be computed according to Equation (17), and we can obtain a list of the nodes in descending order of their infection probabilities. Comparing with the ground truth , we can evaluate the performance of the prediction with two metrics, Mean Average Precision(MAP) and orderPrecision.
As a metric originating from information retrieval, MAP can evaluate the prediction of information cascades by taking positions of nodes in the predicting list into consideration. We first define the top precision of as the hit rate of the first nodes of over the ground truth, i.e.,
(19) 
where is the set of first nodes of . Then based on , we can define the average precision of as
(20) 
where denotes the rank of node in and is the top precision of . From Equations (19) and (20) we can see that, it will lead to a low if too many nodes which occur in but rank low in . What’s more, we set the size of the predicted list in {100, 300, 500, 700, 900} to compute among the first nodes. Finally, can be defined as the average of over testing set , i.e.,
(21) 
To evaluate the prediction of infection order, we propose a new metric, orderPrecision, which is defined as
(22) 
where is the true infection time of and is the predicted one, and and denotes the sets of nodes infected before node in the ground truth list and the predicted list respectively. The idea of Equation (22) is that the more nodes with more similar relative orders of nodes in and , the higher the orderPrecision of . First, to evaluate the similarity of node s relative orders in and
, we consider a heuristic indicator, the number of the nodes that are infected before node
and shared by and , i.e., , and the larger this number is, the more similar the relative orders will be. Then we can obtain the relative order similarity for one single testing cascade by taking the average over all nodes shared by and . Finally, the overall orderPrecision is the average of the relative order similarities over all testing cascades in .5.3 Hyperparameter Tuning
In this subsection, we investigate the hyperparameters and in Equation (12) on the validation set, which control the influence of the cascading affinity and the structure proximity on the embedding learning, respectively.
For simplicity, we fix and adopt a grid search in the range of with a step size of 0.2 to determine the optimal values of and . Fig. 4, Fig. 4, and Fig. 4 show the results of MAP and orderPrecision over different combinations of and on three datasets. Through a comprehensive comparison, we can find that, in most cases the MAPs and orderPrecisions at nonzero and are better than those at zero and . Taking the Fig.4 (a) as an instance, the MAP value at (0.6, 0.8) is 0.8835, which is higher than 0.8703 at (0.0, 0.0). It verifies that appropriately applying cascading affinity and structural proximity as constrains can improve the learned embeddings for information cascade prediction. The combinations of and at which the sum of MAP and orderPrecision achieve the highest are chosen for the remaining experiments. Based on this criterion, we set (, ) as (0.1, 0.9) for Digg, (0.6, 0.8) for Twitter, and (0.8, 0.2) for Weibo.
5.4 Effectiveness
In this section, we will analyze the experiments results in the tasks of infection prediction and infection order prediction, which are presented in Table 3 and Figure 5 respectively.
5.4.1 Infection Prediction
Dataset  Method  MAP@ (%)  

@100  @300  @500  @700  @900  
Digg  NetRate  1.108  5.749  10.933  16.618  24.043 
CDK  27.951  39.766  52.032  65.220  80.408  
EmbeddedIC  2.084  9.073  23.314  47.249  78.066  
TopoLSTM  2.444  17.535  25.812  42.779  69.534  
DCEC  32.356  55.308  63.546  66.823  86.879  
DCE  47.497  72.952  76.694  84.250  91.362  
NetRate  0.140  2.550  6.724  15.058  30.572  
CDK  9.512  22.724  34.701  48.162  63.315  
EmbeddedIC  0.751  4.740  12.568  24.985  43.347  
TopoLSTM  0.665  5.084  13.681  26.083  42.050  
DCEC  15.983  27.846  37.427  53.617  65.858  
DCE  16.376  29.773  40.690  56.301  69.863  
NetRate  0.469  2.696  7.724  15.280  25.583  
CDK  1.124  11.510  25.348  41.810  54.429  
EmbeddedIC  0.185  3.988  9.706  18.965  30.738  
TopoLSTM  0.005  0.268  2.204  7.084  19.774  
DCEC  3.466  28.526  52.084  62.684  71.339  
DCE  10.506  30.986  53.555  64.533  72.746 
Tables 3 gives the MAPs of different methods for infection prediction task, with the best ones in each case being boldfaced. From Table 3 we can make some analyses as follows:

The proposed DCEC and DCE always outperform all baselines, giving improvements on the best baselines by (Twitter, MAP@500) to (Digg, MAP@300) relatively across all datasets. We can also find that DCE achieves better results than DCEC in every case, and it proves that by using node collaborations as constrains, DCE can better characterize relations between nodes, which are important in information cascades.

The results show that, through collaboratively mapping the nodes into a latent space with a deep architecture, DCE can better capture deep and nonlinear features of nodes in information cascades than Netrate, which estimates infection probability directly with a shallow probabilistic model.

In contrast with embedding baselines CDK, EmbeddedIC, and TopoLSTM, DCE’s deep collaborative embedding architecture can better preserve the cascading characteristics and structural properties of nodes, which are crucial for infection prediction. Unlike CDK which assumes unrealistically that information diffusion is driven by the relations between source node and the others, in DCE all infected nodes are thought to have potential influence on the not yet infected ones and cascading context is employed to model their temporal relations. And as DCE makes no assumption about the underlying diffusion mechanism, it can better utilize the cascading contexts of nodes than EmbeddedIC which is based on the IC model. Compared with TopoLSTM that also adopts a deep model, DCE does not rely on the knowledge of the underlying diffusion network, which is usually difficult to obtain.
5.4.2 Infection Order Prediction
In Figure 5 the orderPrecisions of different methods for infection order prediction are presented, based on which several analyses can be made as follows:

We can see that the proposed DCEC and DCE achieve best performance in all three datasets. The reason is that with the proposed cascading context, DCE is able to not only better preserve the temporal relations, but also better capture the infection order characteristics in information cascades than baselines. And DCE’s superior results over DCEC reveals that, even though cascading affinity and structural property do not indicate nodes infection orders explicitly, they can lead to further improvements when they are used as constrains in DCE.

To be more specific, NetRate is incapable of capturing the infection order features with its shallow probabilistic model. While CDK exploits heat diffusion kernel to formulate a ranking problem, where infection orders are kind of modeled, it can not fully characterize node infection order features like the proposed cascading context in DCE. For embeddedIC, nodes infection orders do not get any attention in this ICbased model and certainly can not be captured, which results in its bad performance. Notwithstanding TopoLSTM’s adoption of diffusion topology can encode the nodes infection orders to some extent, it still can not get rid of the dependence on the underlying diffusion network, which can not always be satisfied.
6 Related Work
In this section, we briefly review two lines of related works with our research, including network embedding and information cascade prediction.
6.1 Network Embedding
With the wide employment of embedding methods in various machine learning tasks
Mikolov et al. (2013a, b); Pota et al. (2019); Esposito et al. (2020); Deng et al. (2020), network embedding also gains more and more attentions and applicationsCui et al. (2017); Cai et al. (2018). Network embedding refers to assigning nodes in a network to lowdimensional representations and effectively preserving the network structure Cui et al. (2017). Intuitively, nodes can be represented by their correspondent raw or column feature vectors in the adjacent matrix of a network. However, sometimes these vectors are sparse with high dimensions, which brings challenges to machine learning tasks. As a result, a set of traditional network embedding methods Tenenbaum et al. (2000); Roweis and Saul (2000); Belkin and Niyogi (2001); Cox and Cox (2001) are proposed mainly for dimension reduction. Nevertheless, these methods can only work well on networks of relatively small sizes and suffer from high computation cost when coping with online social networks with huge numbers of nodes.Recent works like DeepWalk Perozzi et al. (2014) and LINE Tang et al. (2015) are proposed to learn lowdimensional representations for nodes through an optimization process instead of directly transforming the original feature vectors, where the scaling problem also can be well handled. Inspired by word2vec Mikolov et al. (2013a, b), DeepWalk considers the nodes in network as the words in natural language and utilizes random walks to generate node sequences, based on which the node representations are learned following the procedure of word2vec. As a more generalized version of DeepWalk, node2vec is proposed in Grover and Leskovec (2016) with biased random walks to control the generation of nodes’ contexts more flexibly. LINE produces embeddings for nodes with the expectation to preserve both the firstorder and secondorder proximities of the network neighborhood structure. Under the influence of these researches, a collection of network embedding methods are proposed for different scenarios. For instances, Swami et al. (2017) modifies DeepWalk for heterogeneous networks by introducing metapath based random walks, and Xu et al. (2017)
incorporates a harmonious embedding matrix to further embed the embeddings that only encode intranetwork edges. As the deep neural network has shown remarkable effectiveness in many machine learning tasks, there also emerges a series of works which perform network embedding with a deep model. For example,
Wang et al. (2016)adopts a semisupervised deep autoencoder model to exploit the firstorder and secondorder proximities jointly to preserve the network structure.
Liao et al. (2018)learns nodes representations by keeping both the structural proximity and attribute proximity with a designed multilayered perceptron framework. And in
Chang et al. (2015), the researchers use a highly nonlinear multilayered embedding function to capture the complex interactions between the heterogeneous data in a network.However, most of these network embedding methods Xie et al. (2019); Goyal et al. (2020); Huang et al. (2019) are not applicable to information cascade prediction. In our work, we employ an autoencoder based collaborative embedding architecture to learn embeddings from nodes’ cascading contexts with constrains.
6.2 Information Cascade Prediction
Information cascade phenomena have been widely investigated in the context of epidemiology, physical science and social science, and the development of online social network has greatly promoted related researches Chou and Chen (2018); Li et al. (2018); Varshney et al. (2017). Most of the early researches Kempe and Kleinberg (2003) analyse information cascade based on fixed models, the representatives among which are Independent Cascade(IC) Goldenberg et al. (2001) model and the Linear Threshold(LT) Granovetter (1978) model. Classic IC model treats the diffusion activity of information as cascades while the LT model determines infections of users according to thresholds of the influence pressure incoming from the neighborhood. Both of them can be unified into a same framework Kempe and Kleinberg (2003), and a series of extension work has been proposed Saito et al. (2008, 2010, 2009); Guille and Hacid (2012); Wang et al. (2012); Ding et al. (2019); Gursoya and Gunnec (2018). For example, Saito et al. (2010) extends the IC model to formulate a generative model that can take time delay into consideration. However, information diffusion processes are so complicated that we seldom exactly know the underlying mechanisms of how information diffuses. What’s more, these works are often based on the assumption that the explicit paths along which information propagates between nodes are observable, which is difficult to satisfy.
A collection of methods are proposed to infer the most possible links that can best explain the observed diffusion cascades without knowing the explicit paths. For instance, NetInf GomezRodriguez et al. (2011) and Connie Myers and Leskovec (2010) use greedy algorithms to find a fixed number of links between users that maximize the likelihood of a set of observed diffusions under an IClike diffusion hypothesis. And a more general framework called NetRate Rodriguez et al. (2011) has been proposed, which also occurs in our experiments as a baseline. NetRate models information diffusion process as discrete networks of continuous temporal process occurring at different rates, and then infers the edges of the global diffusion network and estimates the transmission rates of each edge that best explain the observed data Rodriguez et al. (2011). There are also further variants of this framework being proposed GomezRodriguez and Leskovec (2013); Wang et al. (2014). However, most of these works still rely on the assumption that information diffusion follows a parametric model.
In recent years, a set of researches Bourigault et al. (2014); Gao et al. (2017); Bourigault et al. (2016); Wang et al. (2017); Qiu et al. (2018) which adopt network embedding techniques to handle information cascade prediction have be proposed. These methods usually embed nodes in a latent feature space, then the diffusion probabilities between nodes are computed based on their positions in the space. CDK proposed in Bourigault et al. (2014) treats information diffusion as a ranking problem and maps nodes to a latent space using a heat diffusion process. However, it assumes the infected nodes orders of a cascade is influenced by the relations between source node and the other nodes, which is not realistic. Bourigault et al. (2016) follows the mechanism of IC model to embed users in cascades into a latent space. Wang et al. (2017) puts dynamic directed acyclic graphs into an LSTMbased model to generate topologyaware embeddings for nodes, which depends a lot on the network structure information. In contrast, our proposed method DCE collaboratively embed the nodes with a deep architecture into a latent space, without requirement of the knowledge about the underlying diffusion mechanisms and the explicit paths of diffusions on the network structure.
7 Conclusions
In this paper, we address the problem of information cascade prediction in online social networks with the network embedding techniques. We propose a novel model called Deep Collaborative Embedding (DCE) for information cascade prediction which can learn embeddings for not only infection prediction but also infection order prediction in a cascade, without the requirement to know the underlying diffusion mechanisms and the diffusion network. We propose an autoencoder based collaborative embedding architecture to generate the embeddings that preserve the node structural property as well as the node cascading characteristics simultaneously in the learned embeddings. The results of extensive experiments conducted on real datasets verify the effectiveness of the proposed method.
Acknowledgment
This work is supported by National Natural Science Foundation of China under grant 61972270, and in part by NSF under grants III1526499, III1763325, III1909323, CNS1930941, and CNS1626432.
References
References
 Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, pp. 585–591. Cited by: §6.1.
 Learning social network embeddings for predicting information diffusion. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 393–402. Cited by: §1, §5.1.2, §6.2.
 Representation learning for information diffusion through social networks: an embedded cascade model. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining, pp. 573–582. Cited by: §1, §5.1.2, §6.2.
 A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30 (9), pp. 1616–1637. Cited by: §6.1.
 Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 119–128. Cited by: §1, §6.1.
 Can cascades be predicted?. In Proceedings of the 23rd International Conference on World Wide Web, pp. 925–936. Cited by: §1.
 Learning multiple factorsaware diffusion models in social networks. IEEE Transactions on Knowledge and Data Engineering 30 (7), pp. 1268–1281. Cited by: §6.2.
 Multidimensional scaling. Journal of the Royal Statistical Society 46 (2), pp. 1050 C1057. Cited by: §6.1.
 A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering 31 (5), pp. 833–852. Cited by: §6.1.
 Lowrank local tangent space embedding for subspace clustering. Information Sciences 508, pp. 1–21. Cited by: §6.1.
 Influence maximization based on the realistic independent cascade model. KnowledgeBased Systems, pp. 105265. Cited by: §6.2.
 Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering. Information Sciences 514, pp. 88–105. Cited by: §6.1.
 A novel embedding method for information diffusion prediction in social network big data. IEEE Transactions on Industrial Informatics 13 (4), pp. 2097–2105. Cited by: §1, §6.2.
 Talk of the network: a complex systems look at the underlying process of wordofmouth. Marketing Letters 12 (3), pp. 211–223. Cited by: §1, §6.2.
 Inferring networks of diffusion and influence. ACM Transactions on Knowledge Discovery from Data 5 (4), pp. 1019–1028. Cited by: §1, §6.2.
 Modeling information propagation with survival theory. In Proceedings of the 30th International Conference on Machine Learning, pp. 666–674. Cited by: §6.2.
 Dyngraph2vec: capturing network dynamics using dynamic graph representation learning. KnowledgeBased Systems 187, pp. 104816. Cited by: §6.1.
 Threshold models of collective behavior. American Journal of Sociology 83 (6), pp. 1420–1443. Cited by: §6.2.
 Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. Cited by: §6.1.
 Information diffusion in online social networks:a survey. ACM SIGMOD Record 42 (2), pp. 17–28. Cited by: §1.
 A predictive model for the temporal dynamics of information diffusion in online social networks. In Proceedings of the 21st International Conference on World Wide Web, pp. 1145–1152. Cited by: §1, §6.2.
 Influence maximization in social networks under deterministic linear threshold model. KnowledgeBased Systems 161, pp. 111–123. Cited by: §6.2.

Social dynamics of digg.
Epj Data Science
1 (1), pp. 1–26. Cited by: §5.1.1.  Network embedding by fusing multimodal contents and links. KnowledgeBased Systems 171, pp. 44–55. Cited by: §6.1.
 Maximizing the spread of influence through a social network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 137–146. Cited by: §6.2.
 DeepCas: an endtoend predictor of information cascades. In Proceedings of the 26th International Conference on World Wide Web, pp. 577–586. Cited by: §1.
 Influence maximization on social graphs: a survey. IEEE Transactions on Knowledge and Data Engineering 30 (10), pp. 1852–1871. Cited by: §1, §6.2.
 Attributed social network embedding. IEEE Transactions on Knowledge and Data Engineering 30 (12), pp. 2257–2270. Cited by: §1, §6.1.

Heterogeneous anomaly detection in social diffusion with discriminative feature discovery
. Information Sciences 439440, pp. 1–18. Cited by: §1.  Efficient estimation of word representations in vector space. Computer Science. Cited by: §6.1, §6.1.
 Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 3111–3119. Cited by: §6.1, §6.1.
 On the convexity of latent social network inference. In Proceedings of the 23rd International Conference on Neural Information Processing Systems, pp. 1741–1749. Cited by: §6.2.
 DeepWalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710. Cited by: §1, §6.1.
 Multilingual pos tagging by a composite deep architecture based on characterlevel features and onthefly enriched word embeddings. KnowledgeBased Systems 164, pp. 309–323. Cited by: §6.1.
 DeepInf: social influence prediction with deep learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2110–2119. Cited by: §6.2.
 The mathematical theory of infectious diseases and its applications. Journal of the Royal Statistical Society Series C 26 (1), pp. 85–87. Cited by: §1.
 Uncovering the temporal dynamics of diffusion networks. In Proceedings of the 28th International Conference on Machine Learning, pp. 561–568. Cited by: §5.1.2, §6.2.
 Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), pp. 2323–2326. Cited by: §6.1.
 Generative models of information diffusion with asynchronous timedelay. In Proceedings of 2nd Asian Conference on Machine Learning, Vol. 13, pp. 193–208. Cited by: §6.2.
 Prediction of information diffusion probabilities for independent cascade model. In Proceedings of the 12th International Conference on KnowledgeBased Intelligent Information and Engineering Systems, pp. 67–75. Cited by: §1, §1, §6.2.
 Learning continuoustime information diffusion model for social behavioral data analysis. In Asian Conference on Machine Learning: Advances in Machine Learning, pp. 322–337. Cited by: §6.2.
 Informationtheoretic measures of influence based on content dynamics. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining, pp. 3–12. Cited by: §1.
 Collaborative inference of coexisting information diffusions. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 1093–1098. Cited by: §1.
 Metapath2vec: scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144. Cited by: §6.1.
 LINE:largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. Cited by: §1, §4.2.2, §6.1.
 A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), pp. 2319–2323. Cited by: §6.1.

Predicting information diffusion probabilities in social networks: a bayesian networks based approach
. KnowledgeBased Systems 133, pp. 66–76. Cited by: §1, §6.2.  Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1225–1234. Cited by: §1, §1, §4.2.1, §6.1.

Topological recurrent neural network for diffusion prediction
. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 475–484. Cited by: §1, §5.1.2, §6.2.  Featureenhanced probabilistic models for diffusion network inference. In In European conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 12, pp. 499–514. Cited by: §6.2.
 MMRate:inferring multiaspect diffusion networks with multipattern cascades. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1246–1255. Cited by: §6.2.
 Virality prediction and community structure in social networks. Scientific Reports 3 (8), pp. 2522. Cited by: §5.1.1.
 TPNE: topology preserving network embedding. Information Sciences 504, pp. 20–31. Cited by: §1, §6.1.
 Embedding of embedding (eoe):joint embedding for coupled heterogeneous networks. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining, pp. 741–749. Cited by: §6.1.
 Containment of rumor spread in complex social networks. Information Sciences 506, pp. 113–130. Cited by: §1.
 BLmne: emerging heterogeneous social network embedding through broad learning with aligned autoencoder. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 605–614. Cited by: §4.2.1.

Social influence locality for modeling retweeting behaviors.
In
Proceedings of the TwentyThird International Joint Conference on Artificial Intelligence
, pp. 2761–2767. Cited by: §5.1.1.  IAD: interactionaware diffusion framework in social networks. IEEE Transactions on Knowledge and Data Engineering 31 (7), pp. 1341–1354. Cited by: §1.