1 Introduction
The recent development of online social networks enabled researchers to suggest methods to explain and predict observations of diffusion across networks. Classical cascade models, which are at the heart of the research literature on information diffusion, regard the phenomenon of diffusion as an iterative process in which information transits from users to others in the network (Saito et al., 2008; GomezRodriguez et al., 2011)
, by a socalled wordofmouth phenomenon. In this setting, diffusion modeling corresponds to learning probability distributions of content transmission. Various cascade models have been proposed in the literature, each inducing its own learning process to explain some observed diffusion episodes and attempting to extract the main dynamics of the network. However, most of these models rely on a strong markovian assumption for which the probabilities of next infections
^{1}^{1}1Throughout this paper, we refer to infection for denoting the participation of a node of the network in the diffusion. only depend on who is already infected at the current step, not on the past trajectories of the diffused content. We claim that the history of spread contains much valuable information that should be taken into account by the models. Past trajectories of the diffusion can give insights about the different natures of the contents. Also, the content may be changed during the diffusion, with different transformations depending on which nodes retransmit the information.On the other hand some recent approaches rely on representation learning and recurrent neural networks (RNN) to predict the future spread of diffusion given the past. A naive possibility would be to consider diffusion episodes as sequences of infections and propose temporal point process approaches to model the dynamics. Using the Recurrent Marked Temporal Point Process model
(Du et al., 2016), the current hidden state of the RNN would embed the history of the whole diffusion sequence, which would be used to output the next infected node and its time of infection. However, since diffusion episodes are not sequences but trees, naive recurrent methods usually fail in capturing the true dynamics of the networks. Embedding the whole past in the state of a given node rather than restraining it to its specific ancestor branch leads to consider many independent and noisy events for future predictions. A model that would consider the true diffusion paths would be more effective, by focusing on the useful past. If the true diffusion paths were known, it would be possible to adapt works on recurrent neural models for tree structures such as successfully proposed in (Tai et al., 2015) for NLP tasks. Unfortunately, in most of applications the topology of diffusion is unknown while learning. In the task considered in this paper, the only observations available are the timestamps of the infected nodes.To cope with it, (Wang et al., 2017a) proposed TopoLSTM, a LongShort Time Memory network that considers a known graph of relationships between nodes to compute hidden states and cells of infected nodes. The hidden state and cell of a given node at depend on those from each of its predecessors that have been infected before . Since nodes may have multiple predecessors that are infected at time , the classical LSTM cannot be applied directly. Instead, (Wang et al., 2017a) proposed a cell function that aggregates infector candidate states via mean pooling. This allows to take the topology of the possible diffusion into account, but not the past trajectory of the content (it averages all possible paths). To overcome this, (Wang et al., 2017b) proposed a cascade attentionbased RNN, which defines a neural attention mechanism to assign weights to infector candidates before summing their contribution in the new hidden state at
. The attention network is supposed to learn to identify from whom comes the next infection based on past states. However, such an approach is likely to converge to most of attention weight vectors in the center of the simplex, since diffusion is a stochastic process with mostly very weak influence probabilities. The deterministic inference process of the approach limits its ability to produce relevant states by mixing from multiple candidates rather than sampling the past trajectories from their posterior probabilities. Note the similar approach in
(Wang et al., 2018), which does not use a RNN but defines a composition module of the past via attention networks to embed the episode in a representation space from which the probability of the next infected node can be deduced. Beyond the limits discussed above w.r.t. deterministic mixing of diffusion trajectories, no delay of infection is considered in this work, which makes it impossible to use for diffusion modeling purposes.Recently, many works in representation learning used random walks on graphs to sample trajectories that can be used to learn a network embedding which respects some topological constraints. While DeepWalk (Perozzi et al., 2014) only uses structural information, models proposed in (Nguyen et al., 2018) or (Shi et al., 2018) include temporal constraints in random walks to sample feasible trajectories w.r.t. observed diffusion episodes. The approach DeepCas from (Li et al., 2016) applies this idea for the prediction of diffusion cascades. However, such approaches require a graph of diffusion relations as input, which is not always available (and not always representative of the true diffusion channels of the considered network as reported in (Ver Steeg & Galstyan, 2013)). In our work, we consider that no graph is available beforehand. Moreover, no inference process is introduced in DeepCas to sample trajectories from their posterior probabilities given the observed diffusion sequences. The sampling of trajectories is performed in an initialization step, before learning the parameters of the diffusion model.
In this paper, we propose the first bayesian topological recurrent neural network for sequences with tree dependencies, which we apply for diffusion cascades modelling. Rather than building on a preliminary random walk process, the idea is to consider trajectory inference during learning, in order to converge to better representations of the infected nodes. Following the stochastic nature of diffusion, the model infers trajectories distributions from observations of infections, which are in turn used for the inference of infection probabilities in an iterative learning process. Our probabilistic model, based on the famous continuoustime independent cascade model (CTIC) (Saito et al., 2009) is able to extract full paths of diffusion from sequential observations of infections via blackbox inference, which has multiple applications in the field. Our experiments validate the potential of the approach for modeling and prediction purposes.
2 Background
2.1 Information Diffusion
Information diffusion is observed as a set of diffusion episodes . Classically, episodes considered in this paper only contain the first infection event of each node (the earlier time a content reached the node). Let be a set of nodes, standing for the world node, used to model influences from external factors or spontaneous infections (as done in (Gruhl et al., 2004) for instance). A diffusion episode reports the diffusion of a given content in the network as a sequence of infected nodes and a set of infection timestamps for all . is ordered chronologically w.r.t. the infection timestamps . Thus, corresponds to the th infected node in for all , with the number of infected nodes in the diffusion. Every episode in starts by the world node (i.e., for all episodes ). We note the infection timestamp in for any node in , for nodes not infected in . Timestamps are relative w.r.t. to , arbitrarily set to in the data. Note that we also set for every episode . In the following, denotes the infected node in with its associated timestamp.
Cascades are richer structures than diffusion episodes, since they describe how a given diffusion happened. The cascade structure stores the first transmission event that succeeded from any node to each infected node . Thus, a cascade corresponds to a transmission tree rooted in and reaching nodes in during the diffusion, according to a sequence of infector indices in : for any and any , equals iff was infected by in the diffusion (i.e., is the infector of ). We arbitrarily set (no infector for the world node). Note that is always equal to , since there is no other candidate for being the infector of than the world . For convenience, we note the infector of for . Note that the cascade structures respect that for every and all (the infector of a node is mandatory a node that was infected before ). Several different cascade structures are possible for a given diffusion episode. Cascade models usually perform assumptions on these latent diffusion structures to learn the diffusion parameters.
2.2 Cascade Models
The Independent Cascade model (IC) (Goldenberg et al., 2001) considers the spread of diffusion as cascades of infections over the network. We focus in this paper on cascade models such as IC, which tend to reproduce realistic temporal diffusion dynamics on social networks (Guille et al., 2013). The classical version of IC is an iterative process in which, at each iteration, every newly infected node gets a unique chance to infect any other node of the network with some probability
. The process iterates while new infections occur. Since the expectationmaximization algorithm proposed in
(Saito et al., 2008) to learn its parameters, IC has been at the heart of diffusion works.However, in real life, diffusion occurs in continuous time, not discrete as assumed in IC. (Lamprier et al., 2016) proposed DAIC, a delayagnostic version of IC, where diffusion between nodes is assumed to follow uniform delay distributions rather than occurring in successive discrete timesteps. A representation learning version of DAIC has been proposed in (Bourigault et al., 2016), where nodes are projected in a continuous space in a way that the distance between node representations render the probability that diffusion occurs between them. This allowed the authors to obtain a more compact and robust model than the former version of DAIC. Beyond uniform time delay distributions, two main works deal with continuoustime diffusion. NetRate (GomezRodriguez et al., 2011) learns parametric timedependent distributions to best fit with observed infection timestamps. As NetRate, CTIC (Saito et al., 2009)
uses exponential distributions to model delays of diffusion between nodes, but rather than a single parameter for each possible relationship, delays and influence factors are considered as separated parameters, which leads to more freedom w.r.t. to observed diffusion tendencies. Delays and influence parameters are learned conjointly by an EMlike algorithm. Note the continuoustime cascade model extension in
(Zhang et al., 2018), which embeds users in a representation space so that their relative positions both respect some structural community properties and can be used to explain infection timestamps of users.In our model we consider that diffusion probabilities from any infected node depend on a latent state associated to , which embeds the past trajectory of the diffused content. This state depends on the state of the node who first transmitted the content to . Therefore, we need to rely on a continuoustime model such as CTIC (Saito et al., 2009), which serves as a basis for our work. In CTIC, two parameters are defined for each pair of nodes in the network: , which corresponds to the probability that node succeeds in infecting , and , which corresponds to a timedelay parameter used in an exponential distribution when infects . If succeeds in infecting in an episode , is infected at time , where . These parameters are learned via maximizing the following likelihood on a set of episodes :
(1) 
where stands for the probability that is infected at by previous infected nodes in and is the probability that is not infected by any infected node in .
We build on this in the following, but rather than considering a pair of parameters and for each pair of nodes (which implies parameters to store), we propose to consider neural functions which output the corresponding parameters according to the hidden state of the emitter , depending on its ancestor branch in the cascade, and a continuous embedding of the receiver .
3 Recurrent Neural Diffusion Model
This section first presents our recurrent generative model of cascades. Then, it details the proposed learning algorithm.
3.1 Recurrent Diffusion Model
As discussed above, we consider that each infected node in an episode owns a state depending on the path the content followed to reach in , with the dimension of the representation space. Knowing the state of the node that first infected , the state is computed as:
(2) 
with a function, with parameters , that transforms the state of according to a shared representation of the node
. This function can either be an Elman RNN cell, a multilayer perceptron (MLP) or a Gated Recurrent Unit (GRU). An LSTM could also be used here, but
should include both the cell and the state of in that case.Given a state for an infected node in , the probability that succeeds in transmitting the diffused content to is given by:
(3) 
where
stands for the sigmoid function and
an embedding of size for any node .Similarly to the CTIC model, if a node succeeds in infecting another node , the delay of infection depends on an exponential distribution with parameter . To simplify learning, we assume that the delay of infection does not depend on the history of diffusion, only the probability of infection does. Thus, for a given pair , the delay parameter is the same for every episode :
(4) 
with the absolute value of a real scalar and and correspond to two embeddings of size for any node , for the source of the transition and for its destination in order to enable asymmetric behavior.
We set as the parameters of our model. The generative process, similar to the one of CTIC, is given in appendix in algorithm 1. In this process, the state of the initial node, the world node , is a parameter vector to be learned. The process iterates while there remains some nodes in a set of infectious nodes (initialized with ). At each iteration, the process selects the infectious node with minimal timestamp of infection, removes it from the set of infectious nodes, records it as infected and attempts to infect each non infected node according to the probability . If it succeeds, a time is sampled for with an exponential law with parameter . If the new time for is lower than its current time (initialized with ), this new time is stored in , is added to the set of infectious nodes and its new state is computed according to its new infector .
3.2 Learning the Model
As in CTIC, we need to define the probability that the node infects the node at time with our model. Given a state for in , we have:
(5) 
Also, the probability that does not infect before given a state for in is:
(6) 
The probability density that node is infected at time given a set of states for all nodes infected before is:
(7)  
(8) 
where is the state of node in . Similarly, the probability density that node is not infected in at the end of observation time given a set of states for all nodes infected in is:
(9) 
where the approximation is done assuming a sufficiently long observation period.
The learning process of our model is based on a likelihood maximization, similarly to maximizing eq.1 in the classical CTIC model. However, in our case the infection probabilities depend on hidden states associated to the infected nodes. Since observations only contain infection timestamps, this requires to marginalize over every possible sequence of ancestors for every :
(10) 
where is the set of all possible ancestors sequences for . corresponds to the joint probability of the episode and an ancestor sequence . Taking would lead to an intractable computation of
using our recurrent cascade model, since it would imply to estimate the probability of any infection in
according to the full ancestors sequence. Fortunately, using the bayesian chain rule, the joint probability can be written as:
(11) 
where corresponds to the sequence of the first components of (the first components in with their associated timestamps) and stands for the vector containing the first components of . We have for every : , where is a set containing the states of the first infected nodes in , which can be deduced from and using the equation 2. We also have: . The probability is the conditional probability that was the node who first infected , given all the previous infection events and the fact that was infected at by one of the previously infected nodes in . It can be obtained, according to formula 8, via:
(12) 
with the infector of stored in .
Unfortunately the loglikelihood from formula 10 is still particularly difficult to optimize directly since it requires to consider every possible vector for each training episode at each optimization iteration. Moreover, the probability products in formula 11 would lead to zero gradients because of decimal representation limits. Therefore, we need to define an approach where the optimization can be done via trajectory sampling. Different choices would be possible. First, MCMC approaches such as the Gibbs Sampling EM could be used, but they require to sample from the posteriors of the full trajectories of the cascades, which is very unstable and complex to perform. The full computation of the posterior distributions could be avoided by using simpler propositional distributions (such as done for instance via importance sampling with auxiliary variables in (Farajtabar et al., 2015)
for diffusion source detection), but this would face a very high variance in our case. Another possibility is to adopt a variational approach
(Kingma & Welling, 2013), where an auxiliary distribution is learned for the inference of the latent variables. As done in (Krishnan et al., 2016) for the inference in sequences, a smoothing strategy could be developed by relying on a bidirectional RNN that would consider past and future infections for the inference of the ancestors of nodes via for every infected node in an episode . However, learning the parameters of such a distribution is particularly difficult (episodes of different lengths, cascades considered as sequences, etc.). Also, another possibility for smoothing would be to define an independent distribution for every episode and every infection . However, this induces a huge number of variational parameters, increasing with the size of the training set (linearly in the number of training episodes and quadratically in the size of the episodes). Thus, we propose to rather rely on the conditional distribution of ancestors given the past for sampling (i.e, ), which corresponds to a filtering inference process.From the Jensen inequality on concave functions, we get for a given episode :
(13)  
where . This leads to a lowerbound of the loglikelihood, which corresponds to an expectation from which it is easy to sample: at each new infection of a node in a episode , we can sample from a distribution depending on the past only. Maximizing this lowerbound (also called the ELBO) encourages the process to choose trajectories that explain the best the observed episode. To maximize it via stochastic optimization, we refer to the score function estimator (Ranganath et al., 2014), which leverages the derivative of the logfunction () to express the gradient as an expectation from which we can sample. Another possibility would have been to rely on the GumbelSoftmax and the Concrete distribution with reparametrization such as done in (Maddison et al., 2016), but we observed greatly better results using the logtrick. The gradient of the ELBO function for all the episodes is given by:
(14) 
where is a shortcut for and
is a movingaverage baseline of the ELBO per training episode, used to reduce the variance (taken over the 100 previous training epochs in our experiments). This stochastic gradient formulation enables to obtain unbiased steepest ascent directions despite the need to sample the ancestor vectors for the computation of the node states (with the replacement of expectations by averages over
samples for each episode). It contains two terms: while the first one encourages high conditional probabilities for ancestors that maximize the likelihood of the full episodes, the second one leads to improve the likelihood of the observed infections regarding the past of the sampled diffusion path.4 Experiments
4.1 Setup
We perform experiments on two artificial and three realworld datasets:

Arti1: Episodes generated on a scalefree random graph of 100 nodes. The generation process follows the CTIC model. But rather than only one transmission probability parameter per edge, we set 5 different depending on the diffusion nature. Before each simulation a number is sampled, which determines the parameters to use. 10000 episodes for training, 5000 for validation, 5000 for testing. Mean length of the episodes=7.55 (stdev=5.51);

Arti2: Episodes sampled on the same graph as Arti1, also with CTIC but where each is a function of the transmitted content and the features of the receiver . A content is sampled from a Dirichlet with parameter before each simulation and the sigmoid of the dot product between this content and the edge features determines the transmission probabilities. Features of the hub nodes (nodes with a degree greater than 30) are sampled from a Dirichet with (multicontent nodes), while those of other nodes are sampled from a Dirichet with (contentspecific nodes). 10000 episodes for training, 5000 for validation, 5000 for testing. Mean length of the episodes=6.89 (stdev=7.7).

Digg: Data collected from the Digg stream API during one month. Infections are the ”diggs” posted by users on links published on the media. We kept the 100 most active users from the collected data. 20000 episodes for training, 5000 for validation, 5000 for testing. Mean length of the episodes=4.26 (stdev=9,26).

Memetracker: The memetracker dataset described in (Leskovec et al., 2009) contains millions of blog posts and news articles. Each website or blog stands as a user, and we use the phrase clusters extracted by Leskovec et al. (2009) to trace the flow of information. 500 nodes, 250000 for training, 5000 for validation, 5000 for testing. Mean length of episodes=8.68 (stdev=11.45).
We compare our model recCTIC to the following temporal diffusion baselines:

CTIC: the ContinuousTime Independent Cascade model in its original version (Saito et al., 2009);

RNN: the Recurrent Temporal Point Process model from (Du et al., 2016) where episodes are considered as sequences that can be treated with a classical RNN outputting at each step the probability distributions of the next infected node and its timestamp;

CYAN: Similar to RMTPP but with an attention mechanism to select previous states (Wang et al., 2017b);

CYANcov: The same as Cyan but with a more sophisticated attention mechanism using an history of attention states, to give more weights to important nodes;

DAN: the attention model described in
(Wang et al., 2018). It is very similar to CYAN but uses a pooling mechanism rather than a recurrent network to consider the past in the predictions. In the version of (Wang et al., 2018), the model only predicts the next infected node at each step, not its time of infection. To enable a comparison with the other models, we extended it by adding a time prediction mechanism similar to the temporal head of CYAN. 
EmbCTIC: a version of our model where the node state is replaced in the diffusion probability computation (eq. 3) by a static embedding for the source (similarly to the formulation of the delay parameter in eq. 4). This corresponds to an embedded version of CTIC, similarly to the embedded version of DAIC from (Bourigault et al., 2016).
Note that to adapt baselines based on RNN for diffusion modeling and render them comparable to cascadebased ones, we add a ”end node” at the end of each episode before training and testing them. In such a way, these models are able to model the end of the episodes by predicting this end node as the next event (no timedelay prediction for this node however).
Our model and the baselines were tuned by a grid search process on a validation set for each dataset (although the best hyperparameters obtained for Arti1 remained near optimal for the other ones). For every model with an embedding space (i.e., all except CTIC), we set its dimension to (larger dimensions induce a more difficult convergence of the learning without significant gain in accuracy). The reported results for our model use a GRU module as the recurrent state transformation function .
We evaluate our models on three distinct tasks:

Diffusion modelling: the performances of the methods are reported in term of negative loglikelihood of the test episodes (i.e., ). Lower values denote models that are less surprised by test episodes than others, rendering their generalization ability. The NLL measure depends on the model, but for each it renders the probability of an episode to be observed according to the model, both on which nodes are eventually infected and at what time. For our model which has to sample trajectories, the NLL is approximated via importance sampling by considering computed on infector vectors sampled from . We used in our experiments;

Diffusion generation: the models are compared on their ability to simulate realistic cascades. The aim is to predict the marginal probabilities of nodes to be eventually infected. The results are reported in term of CrossEntropy (CE) taken over the whole set of nodes for each episode: , where stands for the indicator function returning 1 if its argument is true, 0 else. is estimated via MonteCarlo simulations (following the generation process of the models and counting the rate of simulations in which is infected). 1000 simulations are performed for each test episode in our experiments.

Diffusion Path prediction: the models are assessed on their ability to choose the true infectors in observed diffusion episodes. This is only considered on the artificial dataset for which we have the ground truth on who infected whom. The INF measure corresponds to the expectation of the rate of true infectors chosen by the models: , with the true infector of the th infected node in the episode . For RNN, their is no selection mechanism, it is excluded from the results for this measure. For models with attention (CYAN and DAN), we consider the attention weights as selection probabilities. For cascade based models which explicitly model this, we directly use the corresponding probability . In our model, this corresponds to an expectation over previous infectors in the cascade (i.e., ), with the th sampled vector from .
For each task, we report results with different amounts of initial observations from test episodes: infections occurred before a given delay from the start of the episode are given as input to the models, from with they infer internal representations, evaluation measures are computed on the remaining of the episode. In tables 1 to 4, 0 means that nothing was initially observed, the models are not conditioned on the start of the episodes. 1 means that infections at the first time stamp are known beforehand, prediction and modeling results thus concern timestamps greater than 1 (models are conditioned on diffusion sources). 2 and 3 mean that infections occurred respectively before a delay of and a delay of from the start of the episode are known and used to condition the models. Details on how conditioning our model and the baselines w.r.t. starts of episodes are given in the appendix.
4.2 Results
Results on the two artificial datasets are given in table 1. While our approach shows significant improvements over other models for NLL and CE results on the Arti1 dataset (except for CE with weak conditioning), its potential is fully exhibited on the Arti2 dataset, where embedding the history for predicting the future of diffusion looks of great importance. Indeed, in this dataset, there exists some hub nodes through which most of the diffusion episodes transit, whatever the nature of the diffusion. In that case, the path of the diffusion contains very useful information that can be leveraged to predict infections after the hub node: the infection of the hub node is a necessary condition for the infection of its successors, but not sufficient since this node is triggered in various kinds of situations. Depending on who transmitted the content to it, different successors are infected then. Markovian cascade models such as CTIC or embCTIC cannot model this since infection probabilities only depend on disjoint past events of infection, not on paths taken by the propagated content. RNNbased models are better armed for this, but their performances are undermined by their way of aggregating past information. Attention mechanisms of CYAN and DAN attempt to overcome this, but it looks quite unstable with errors accumulating through time. Our approach appears as an effective compromise between both worlds, by embedding past as RNN approaches, while maintaining the bayesian cascade structures of the data. Its results on the INF measure are very great compared to the other approaches. This is especially true on the Arti2 dataset, which highlights its very good ability for uncovering the dynamics of diffusion, even for such complex problems with strong entanglement between diffusions of different natures.
The good behavior of the approach is not only observed on the artificial datasets, which have been generated by the cascade model of CTIC on a graph of relationships, but also on realworld datasets. Tables 2 to 4 report results on the three realworld datasets. In these tables, we observe that RNN based approaches have more difficulties to model test episodes than cascade based ones. The attention mechanism of the CYAN and DAN approaches allow them to get sometimes closer to the cascade based results (especially on Digg), but their performances are very variable from a dataset to another. These methods are good for the task they were initially designed  predicting the directly next infection (this had been observed in our experimentations), but not for modeling or long term prediction purposes. This is a strong limitation since the directly next infection does not help much to understand the dynamics and to predict the future of a diffusion. Our approach obtains better results than all other methods in most of settings, especially for the dynamics modelling task (NLL), though infection prediction results (CE) are also usually good compared with its competitors. Interestingly, while embCTIC usually beats CTIC, recCTIC often obtains even better results. This validates that the history of nodes in the diffusion has a great importance for capturing the main dynamics of the network. Thanks to the blackbox inference process and the recurrent mechanism of our proposal, the propositional distribution is encouraged to resemble the conditionnal probability of the full ancestors vector. Regarding the results, the inference process looks to have actually converged toward useful trajectories. This enables the model to adapt distributions regarding the diffusion trajectory during learning. This also allows the model to simulate more consistent cascades regarding sources of diffusion.
5 Conclusion
We proposed a recurrent cascadebased diffusion modeling approach, which is at the crossroads of cascadebased and RNN approaches. It leverages the best of both worlds with an ability to embed the history of diffusion for prediction while still capturing the tree dependencies underlying the diffusion in network. Results validate the approach both for modeling and prediction tasks.
In this work, we based the sampling of trajectories on a filtering approach where only the past observations are considered for the inference of the infector of a node. Outgoing works concern the development of an inductive variational distribution that rely on whole observed episodes for inference.
References
 Bourigault et al. (2016) Simon Bourigault, Sylvain Lamprier, and Patrick Gallinari. Representation learning for information diffusion through social networks: an embedded cascade model. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, February 2225, 2016, pp. 573–582, 2016. doi: 10.1145/2835776.2835817. URL http://doi.acm.org/10.1145/2835776.2835817.
 Du et al. (2016) Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel GomezRodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 1555–1564, New York, NY, USA, 2016. ACM. ISBN 9781450342322. doi: 10.1145/2939672.2939875. URL http://doi.acm.org/10.1145/2939672.2939875.
 Farajtabar et al. (2015) Mehrdad Farajtabar, Manuel GomezRodriguez, Nan Du, Mohammad Zamani, Hongyuan Zha, and Le Song. Back to the past: Source identification in diffusion networks from partially observed cascades. CoRR, abs/1501.06582, 2015. URL http://arxiv.org/abs/1501.06582.
 Fu et al. (2013) K. Fu, Chung hong Chan, and M. Chau. Assessing censorship on microblogs in china: Discriminatory keyword analysis and the realname registration policy. Internet Computing, IEEE, 17(3):42–50, May 2013. ISSN 10897801. doi: 10.1109/MIC.2013.28.
 Goldenberg et al. (2001) Jacob Goldenberg, Barak Libai, and Eitan Muller. Talk of the network: A complex systems look at the underlying process of wordofmouth. Marketing letters, 12(3):211–223, 2001.

GomezRodriguez et al. (2011)
Manuel GomezRodriguez, David Balduzzi, and Bernhard Schölkopf.
Uncovering the temporal dynamics of diffusion networks.
In
Proceedings of the 28th International Conference on Machine Learning (ICML11)
, ICML ’11, pp. 561–568. ACM, 2011.  Gruhl et al. (2004) Daniel Gruhl, R. Guha, David LibenNowell, and Andrew Tomkins. Information diffusion through blogspace. In Proceedings of the 13th International Conference on World Wide Web, WWW ’04, pp. 491–501, New York, NY, USA, 2004. ACM. ISBN 158113844X.
 Guille et al. (2013) Adrien Guille, Hakim Hacid, Cecile Favre, and Djamel A. Zighed. Information diffusion in online social networks: A survey. SIGMOD Rec., 42(2):17–28, July 2013. ISSN 01635808. doi: 10.1145/2503792.2503797. URL http://doi.acm.org/10.1145/2503792.2503797.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krishnan et al. (2016) R. G. Krishnan, U. Shalit, and D. Sontag. Structured Inference Networks for Nonlinear State Space Models. ArXiv eprints, September 2016.
 Lamprier et al. (2016) Sylvain Lamprier, Simon Bourigault, and Patrick Gallinari. Influence learning for cascade diffusion models: focus on partial orders of infections. Social Netw. Analys. Mining, 6(1):93:1–93:16, 2016. doi: 10.1007/s1327801604061. URL https://doi.org/10.1007/s1327801604061.
 Leskovec et al. (2009) Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Memetracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, pp. 497–506, New York, NY, USA, 2009. ACM. ISBN 9781605584959. doi: 10.1145/1557019.1557077.
 Li et al. (2016) Cheng Li, Jiaqi Ma, Xiaoxiao Guo, and Qiaozhu Mei. Deepcas: an endtoend predictor of information cascades. CoRR, abs/1611.05373, 2016. URL http://arxiv.org/abs/1611.05373.
 Maddison et al. (2016) Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712, 2016. URL http://arxiv.org/abs/1611.00712.
 Nguyen et al. (2018) Giang Hoang Nguyen, John Boaz Lee, Ryan A. Rossi, Nesreen K. Ahmed, Eunyee Koh, and Sungchul Kim. Continuoustime dynamic network embeddings. In Companion Proceedings of the The Web Conference 2018, WWW ’18, pp. 969–976, Republic and Canton of Geneva, Switzerland, 2018. International World Wide Web Conferences Steering Committee. ISBN 9781450356404. doi: 10.1145/3184558.3191526. URL https://doi.org/10.1145/3184558.3191526.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. CoRR, abs/1403.6652, 2014. URL http://arxiv.org/abs/1403.6652.
 Ranganath et al. (2014) R. Ranganath, S. Gerrish, and D. M. Blei. Black Box Variational Inference. ArXiv eprints, December 2014.
 Saito et al. (2008) Kazumi Saito, Ryohei Nakano, and Masahiro Kimura. Prediction of information diffusion probabilities for independent cascade model. In Proceedings of the 12th international conference on KnowledgeBased Intelligent Information and Engineering Systems, Part III, KES ’08, pp. 67–75. SpringerVerlag, 2008.
 Saito et al. (2009) Kazumi Saito, Masahiro Kimura, Kouzou Ohara, and Hiroshi Motoda. Learning continuoustime information diffusion model for social behavioral data analysis. In Proceedings of the 1st Asian Conference on Machine Learning: Advances in Machine Learning, ACML ’09, pp. 322–337, Berlin, Heidelberg, 2009. SpringerVerlag. ISBN 9783642052231.
 Shi et al. (2018) Yong Shi, Minglong Lei, Peng Zhang, and Lingfeng Niu. Diffusion based network embedding. CoRR, abs/1805.03504, 2018. URL http://arxiv.org/abs/1805.03504.
 Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from treestructured long shortterm memory networks. CoRR, abs/1503.00075, 2015. URL http://arxiv.org/abs/1503.00075.
 Ver Steeg & Galstyan (2013) Greg Ver Steeg and Aram Galstyan. Informationtheoretic measures of influence based on content dynamics. In Proceedings of the sixth ACM international conference on Web search and data mining, WSDM ’13, pp. 3–12, New York, NY, USA, 2013. ACM. ISBN 9781450318693. doi: 10.1145/2433396.2433400.
 Wang et al. (2017a) Jia Wang, Vincent W. Zheng, Zemin Liu, and Kevin ChenChuan Chang. Topological recurrent neural network for diffusion prediction. CoRR, abs/1711.10162, 2017a. URL http://arxiv.org/abs/1711.10162.

Wang et al. (2017b)
Yongqing Wang, Huawei Shen, Shenghua Liu, Jinhua Gao, and Xueqi Cheng.
Cascade dynamics modeling with attentionbased recurrent neural
network.
In
Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17
, pp. 2985–2991, 2017b. doi: 10.24963/ijcai.2017/416. URL https://doi.org/10.24963/ijcai.2017/416.  Wang et al. (2018) Zhitao Wang, Chengyao Chen, and Wenjie Li. Attention network for information diffusion prediction. In Companion Proceedings of the The Web Conference 2018, WWW ’18, pp. 65–66, Republic and Canton of Geneva, Switzerland, 2018. International World Wide Web Conferences Steering Committee. ISBN 9781450356404. doi: 10.1145/3184558.3186931. URL https://doi.org/10.1145/3184558.3186931.
 Zhang et al. (2018) Yuan Zhang, Tianshu Lyu, and Yan Zhang. Cosine: Communitypreserving social network embedding from information diffusion cascades. In AAAI, 2018.
6 Appendix
6.1 Joint Probability
In this section, we detail the derivation of whose formulation is given in equation 11. For each infected node at position , we need to compute:

the probability for for being infected at its time of infections given the nodes previously infected in and the states associated to these nodes;

the probability of the ancestor index given the th infection , and the previous infections associated to their states ;

the probability that not infected nodes are actually not infected by the th infected node given its state.
This gives:
(15)  
6.2 Generation Process
The generation process of our model is given in algorithm 1. The process iterates while there remains some nodes in a set of infectious nodes (initialized with ). denotes the concatenation between two lists. At each iteration, the process selects the infectious node with minimal timestamp of infection (all timestamps but are initialized to ), removes it from the infectious set and records its infector and infection timestamp in the cascade. Then, for each node with timestamp greater than the one of , attempts to infects according to the probability (computed with eq 3). If it succeeds, is inserted in the set of infectious nodes and a time is sampled for from an exponential law with parameter . If the new time for is lower than its stored time , this new time is stored in , is stored as the infector of in the table (used to build ) and the new state is computed according to its new infector . The generation process outputs a cascade structure (as described in the introduction of the previous section). From the classical CTIC, the only changes are at lines 1, 1 and 1, respectively for the computation of , and .
6.3 Learning Process
The learning process of our model is depicted in algorithm 2. In this algorithm, the function first creates minibatches by ordering in decreasing length and cutting this ordered list in bins of episodes each. Each bin contains 3 matrices with rows (except in the last bin which contains matrices for the remaining episodes):

: a matrix where the cell () contains the infected node in the th episode of the bin, or if the corresponding episode contains less than infected nodes. The width of the matrix is equal to the number of infected nodes in the longest episode in the bin (the episode in the first row of the matrix);

: a matrix where the cell () contains the infection timestamp of the infected node in the th episode of the bin, or if the corresponding episode contains less than infected nodes. The width of the matrix is equal to the number of infected nodes in the longest episode in the bin (the episode in the first row of the matrix);

: a binary matrix with columns where the cell () equals if the node is infected in the the episode of the bin, otherwise;
At each epoch, the algorithm iterates on every bin. For each bin, it first initializes the states of the infected nodes using a function
which produces a tensor
of matrices whose each row is filled by (with and respectively the number of rows and columns in matrix ). For every step of infection in the bin, the prosess first determines in the rows of the matrices which correspond to not ended episodes ( refers to the column of ). Then, if the step is not the initial step , it uses functions and with nodes previously infected for each episode associated to their corresponding states . While the function returns a matrix where the cell contains the logprobability for the th node in the th episode to infect at its infection timestamp (using a matrix version of equation 5), the function returns a same shape matrix where the cell contains the logprobability that the th node in the th episode does not infect before its infection timestamp (using a matrix version of equation 6).Then, ancestors at step are sampled from categorial distributions parameterized by
(deduced from logits
). From them, we compute the logprobability for each infected at step to be actually infected at their timestamp of infection by its corresponding sampled infector. (line 2, where is a function which returns the vector of the sums of each row from ). This quantity is added to the accumulator .Line 2 then computes the states for the nodes infected at step according to the states of the sampled ancestors in (via the function which is a matrix version of equation 2).
At the end of each iteration , the loglikelihood that not infected nodes in are actually not infected by infected nodes at step is computed via , which is a matrix version of equation 9. This quantity is added to the accumulator .
At the end of the bin (when ), a control variate baseline is computed by maintaining a list of the quantity vectors considered in . The baseline considered in the stochastic gradient for any episode is thus equal to the average of for this specific episode taken over the previous epochs.
Finally, the gradients are computed and the optimizer ADAM is used to update the parameters of the model. Note that this algorithm does not use the gradient update given in eq. 14. It is based on and for every (rather than based on the simplification as given in eq. 13). This is equivalent but greatly more efficient since in both cases needs to be estimated for sampling and is much easier to compute than ( involves a simple product while involves a sum of products).