A Variational Topological Neural Model for Cascade-based Diffusion in Networks

Many works have been proposed in the literature to capture the dynamics of diffusion in networks. While some of them define graphical markovian models to extract temporal relationships between node infections in networks, others consider diffusion episodes as sequences of infections via recurrent neural models. In this paper we propose a model at the crossroads of these two extremes, which embeds the history of diffusion in infected nodes as hidden continuous states. Depending on the trajectory followed by the content before reaching a given node, the distribution of influence probabilities may vary. However, content trajectories are usually hidden in the data, which induces challenging learning problems. We propose a topological recurrent neural model which exhibits good experimental performances for diffusion modelling and prediction.


page 1

page 2

page 3

page 4


Topological Recurrent Neural Network for Diffusion Prediction

In this paper, we study the problem of using representation learning to ...

Predicting Diffusion Reach Probabilities via Representation Learning on Social Networks

Diffusion reach probability between two nodes on a network is defined as...

Evaluation and Comparison of Diffusion Models with Motif Features

Diffusion models simulate the propagation of influence in networks. The ...

An Influence-Receptivity Model for Topic based Information Cascades

We consider the problem of estimating the latent structure of a social n...

Exchange-Based Diffusion in Hb-Graphs: Highlighting Complex Relationships

Most networks tend to show complex and multiple relationships between en...

Efficient recurrent neural network methods for anomalously diffusing single particle short and noisy trajectories

Anomalous diffusion occurs at very different scales in nature, from atom...

Influence Estimation and Maximization via Neural Mean-Field Dynamics

We propose a novel learning framework using neural mean-field (NMF) dyna...

1 Introduction

The recent development of online social networks enabled researchers to suggest methods to explain and predict observations of diffusion across networks. Classical cascade models, which are at the heart of the research literature on information diffusion, regard the phenomenon of diffusion as an iterative process in which information transits from users to others in the network (Saito et al., 2008; Gomez-Rodriguez et al., 2011)

, by a so-called word-of-mouth phenomenon. In this setting, diffusion modeling corresponds to learning probability distributions of content transmission. Various cascade models have been proposed in the literature, each inducing its own learning process to explain some observed diffusion episodes and attempting to extract the main dynamics of the network. However, most of these models rely on a strong markovian assumption for which the probabilities of next infections

111Throughout this paper, we refer to infection for denoting the participation of a node of the network in the diffusion. only depend on who is already infected at the current step, not on the past trajectories of the diffused content. We claim that the history of spread contains much valuable information that should be taken into account by the models. Past trajectories of the diffusion can give insights about the different natures of the contents. Also, the content may be changed during the diffusion, with different transformations depending on which nodes re-transmit the information.

On the other hand some recent approaches rely on representation learning and recurrent neural networks (RNN) to predict the future spread of diffusion given the past. A naive possibility would be to consider diffusion episodes as sequences of infections and propose temporal point process approaches to model the dynamics. Using the Recurrent Marked Temporal Point Process model

(Du et al., 2016), the current hidden state of the RNN would embed the history of the whole diffusion sequence, which would be used to output the next infected node and its time of infection. However, since diffusion episodes are not sequences but trees, naive recurrent methods usually fail in capturing the true dynamics of the networks. Embedding the whole past in the state of a given node rather than restraining it to its specific ancestor branch leads to consider many independent and noisy events for future predictions. A model that would consider the true diffusion paths would be more effective, by focusing on the useful past. If the true diffusion paths were known, it would be possible to adapt works on recurrent neural models for tree structures such as successfully proposed in (Tai et al., 2015) for NLP tasks. Unfortunately, in most of applications the topology of diffusion is unknown while learning. In the task considered in this paper, the only observations available are the timestamps of the infected nodes.

To cope with it, (Wang et al., 2017a) proposed Topo-LSTM, a Long-Short Time Memory network that considers a known graph of relationships between nodes to compute hidden states and cells of infected nodes. The hidden state and cell of a given node at depend on those from each of its predecessors that have been infected before . Since nodes may have multiple predecessors that are infected at time , the classical LSTM cannot be applied directly. Instead, (Wang et al., 2017a) proposed a cell function that aggregates infector candidate states via mean pooling. This allows to take the topology of the possible diffusion into account, but not the past trajectory of the content (it averages all possible paths). To overcome this, (Wang et al., 2017b) proposed a cascade attention-based RNN, which defines a neural attention mechanism to assign weights to infector candidates before summing their contribution in the new hidden state at

. The attention network is supposed to learn to identify from whom comes the next infection based on past states. However, such an approach is likely to converge to most of attention weight vectors in the center of the simplex, since diffusion is a stochastic process with mostly very weak influence probabilities. The deterministic inference process of the approach limits its ability to produce relevant states by mixing from multiple candidates rather than sampling the past trajectories from their posterior probabilities. Note the similar approach in

(Wang et al., 2018), which does not use a RNN but defines a composition module of the past via attention networks to embed the episode in a representation space from which the probability of the next infected node can be deduced. Beyond the limits discussed above w.r.t. deterministic mixing of diffusion trajectories, no delay of infection is considered in this work, which makes it impossible to use for diffusion modeling purposes.

Recently, many works in representation learning used random walks on graphs to sample trajectories that can be used to learn a network embedding which respects some topological constraints. While DeepWalk (Perozzi et al., 2014) only uses structural information, models proposed in (Nguyen et al., 2018) or (Shi et al., 2018) include temporal constraints in random walks to sample feasible trajectories w.r.t. observed diffusion episodes. The approach DeepCas from (Li et al., 2016) applies this idea for the prediction of diffusion cascades. However, such approaches require a graph of diffusion relations as input, which is not always available (and not always representative of the true diffusion channels of the considered network as reported in (Ver Steeg & Galstyan, 2013)). In our work, we consider that no graph is available beforehand. Moreover, no inference process is introduced in DeepCas to sample trajectories from their posterior probabilities given the observed diffusion sequences. The sampling of trajectories is performed in an initialization step, before learning the parameters of the diffusion model.

In this paper, we propose the first bayesian topological recurrent neural network for sequences with tree dependencies, which we apply for diffusion cascades modelling. Rather than building on a preliminary random walk process, the idea is to consider trajectory inference during learning, in order to converge to better representations of the infected nodes. Following the stochastic nature of diffusion, the model infers trajectories distributions from observations of infections, which are in turn used for the inference of infection probabilities in an iterative learning process. Our probabilistic model, based on the famous continuous-time independent cascade model (CTIC) (Saito et al., 2009) is able to extract full paths of diffusion from sequential observations of infections via black-box inference, which has multiple applications in the field. Our experiments validate the potential of the approach for modeling and prediction purposes.

The remaining of the paper is structured as follows. Section 2 presents some background and notations of the approach. Section 3 presents the proposed model. Section 4 reports experimental results of the approach compared to various baselines.

2 Background

2.1 Information Diffusion

Information diffusion is observed as a set of diffusion episodes . Classically, episodes considered in this paper only contain the first infection event of each node (the earlier time a content reached the node). Let be a set of nodes, standing for the world node, used to model influences from external factors or spontaneous infections (as done in (Gruhl et al., 2004) for instance). A diffusion episode reports the diffusion of a given content in the network as a sequence of infected nodes and a set of infection time-stamps for all . is ordered chronologically w.r.t. the infection time-stamps . Thus, corresponds to the -th infected node in for all , with the number of infected nodes in the diffusion. Every episode in starts by the world node (i.e., for all episodes ). We note the infection time-stamp in for any node in , for nodes not infected in . Time-stamps are relative w.r.t. to , arbitrarily set to in the data. Note that we also set for every episode . In the following, denotes the infected node in with its associated time-stamp.

Cascades are richer structures than diffusion episodes, since they describe how a given diffusion happened. The cascade structure stores the first transmission event that succeeded from any node to each infected node . Thus, a cascade corresponds to a transmission tree rooted in and reaching nodes in during the diffusion, according to a sequence of infector indices in : for any and any , equals iff was infected by in the diffusion (i.e., is the infector of ). We arbitrarily set (no infector for the world node). Note that is always equal to , since there is no other candidate for being the infector of than the world . For convenience, we note the infector of for . Note that the cascade structures respect that for every and all (the infector of a node is mandatory a node that was infected before ). Several different cascade structures are possible for a given diffusion episode. Cascade models usually perform assumptions on these latent diffusion structures to learn the diffusion parameters.

2.2 Cascade Models

The Independent Cascade model (IC) (Goldenberg et al., 2001) considers the spread of diffusion as cascades of infections over the network. We focus in this paper on cascade models such as IC, which tend to reproduce realistic temporal diffusion dynamics on social networks (Guille et al., 2013). The classical version of IC is an iterative process in which, at each iteration, every newly infected node gets a unique chance to infect any other node of the network with some probability

. The process iterates while new infections occur. Since the expectation-maximization algorithm proposed in

(Saito et al., 2008) to learn its parameters, IC has been at the heart of diffusion works.

However, in real life, diffusion occurs in continuous time, not discrete as assumed in IC. (Lamprier et al., 2016) proposed DAIC, a delay-agnostic version of IC, where diffusion between nodes is assumed to follow uniform delay distributions rather than occurring in successive discrete time-steps. A representation learning version of DAIC has been proposed in (Bourigault et al., 2016), where nodes are projected in a continuous space in a way that the distance between node representations render the probability that diffusion occurs between them. This allowed the authors to obtain a more compact and robust model than the former version of DAIC. Beyond uniform time delay distributions, two main works deal with continuous-time diffusion. NetRate (Gomez-Rodriguez et al., 2011) learns parametric time-dependent distributions to best fit with observed infection time-stamps. As NetRate, CTIC (Saito et al., 2009)

uses exponential distributions to model delays of diffusion between nodes, but rather than a single parameter for each possible relationship, delays and influence factors are considered as separated parameters, which leads to more freedom w.r.t. to observed diffusion tendencies. Delays and influence parameters are learned conjointly by an EM-like algorithm. Note the continuous-time cascade model extension in

(Zhang et al., 2018), which embeds users in a representation space so that their relative positions both respect some structural community properties and can be used to explain infection time-stamps of users.

In our model we consider that diffusion probabilities from any infected node depend on a latent state associated to , which embeds the past trajectory of the diffused content. This state depends on the state of the node who first transmitted the content to . Therefore, we need to rely on a continuous-time model such as CTIC (Saito et al., 2009), which serves as a basis for our work. In CTIC, two parameters are defined for each pair of nodes in the network: , which corresponds to the probability that node succeeds in infecting , and , which corresponds to a time-delay parameter used in an exponential distribution when infects . If succeeds in infecting in an episode , is infected at time , where . These parameters are learned via maximizing the following likelihood on a set of episodes :


where stands for the probability that is infected at by previous infected nodes in and is the probability that is not infected by any infected node in .

We build on this in the following, but rather than considering a pair of parameters and for each pair of nodes (which implies parameters to store), we propose to consider neural functions which output the corresponding parameters according to the hidden state of the emitter , depending on its ancestor branch in the cascade, and a continuous embedding of the receiver .

3 Recurrent Neural Diffusion Model

This section first presents our recurrent generative model of cascades. Then, it details the proposed learning algorithm.

3.1 Recurrent Diffusion Model

As discussed above, we consider that each infected node in an episode owns a state depending on the path the content followed to reach in , with the dimension of the representation space. Knowing the state of the node that first infected , the state is computed as:


with a function, with parameters , that transforms the state of according to a shared representation of the node

. This function can either be an Elman RNN cell, a multi-layer perceptron (MLP) or a Gated Recurrent Unit (GRU). An LSTM could also be used here, but

should include both the cell and the state of in that case.

Given a state for an infected node in , the probability that succeeds in transmitting the diffused content to is given by:



stands for the sigmoid function and

an embedding of size for any node .

Similarly to the CTIC model, if a node succeeds in infecting another node , the delay of infection depends on an exponential distribution with parameter . To simplify learning, we assume that the delay of infection does not depend on the history of diffusion, only the probability of infection does. Thus, for a given pair , the delay parameter is the same for every episode :


with the absolute value of a real scalar and and correspond to two embeddings of size for any node , for the source of the transition and for its destination in order to enable asymmetric behavior.

We set as the parameters of our model. The generative process, similar to the one of CTIC, is given in appendix in algorithm 1. In this process, the state of the initial node, the world node , is a parameter vector to be learned. The process iterates while there remains some nodes in a set of infectious nodes (initialized with ). At each iteration, the process selects the infectious node with minimal time-stamp of infection, removes it from the set of infectious nodes, records it as infected and attempts to infect each non infected node according to the probability . If it succeeds, a time is sampled for with an exponential law with parameter . If the new time for is lower than its current time (initialized with ), this new time is stored in , is added to the set of infectious nodes and its new state is computed according to its new infector .

3.2 Learning the Model

As in CTIC, we need to define the probability that the node infects the node at time with our model. Given a state for in , we have:


Also, the probability that does not infect before given a state for in is:


The probability density that node is infected at time given a set of states for all nodes infected before is:


where is the state of node in . Similarly, the probability density that node is not infected in at the end of observation time given a set of states for all nodes infected in is:


where the approximation is done assuming a sufficiently long observation period.

The learning process of our model is based on a likelihood maximization, similarly to maximizing eq.1 in the classical CTIC model. However, in our case the infection probabilities depend on hidden states associated to the infected nodes. Since observations only contain infection time-stamps, this requires to marginalize over every possible sequence of ancestors for every :


where is the set of all possible ancestors sequences for . corresponds to the joint probability of the episode and an ancestor sequence . Taking would lead to an intractable computation of

using our recurrent cascade model, since it would imply to estimate the probability of any infection in

according to the full ancestors sequence. Fortunately, using the bayesian chain rule, the joint probability can be written as:


where corresponds to the sequence of the first components of (the first components in with their associated time-stamps) and stands for the vector containing the first components of . We have for every : , where is a set containing the states of the first infected nodes in , which can be deduced from and using the equation 2. We also have: . The probability is the conditional probability that was the node who first infected , given all the previous infection events and the fact that was infected at by one of the previously infected nodes in . It can be obtained, according to formula 8, via:


with the infector of stored in .

Unfortunately the log-likelihood from formula 10 is still particularly difficult to optimize directly since it requires to consider every possible vector for each training episode at each optimization iteration. Moreover, the probability products in formula 11 would lead to zero gradients because of decimal representation limits. Therefore, we need to define an approach where the optimization can be done via trajectory sampling. Different choices would be possible. First, MCMC approaches such as the Gibbs Sampling EM could be used, but they require to sample from the posteriors of the full trajectories of the cascades, which is very unstable and complex to perform. The full computation of the posterior distributions could be avoided by using simpler propositional distributions (such as done for instance via importance sampling with auxiliary variables in (Farajtabar et al., 2015)

for diffusion source detection), but this would face a very high variance in our case. Another possibility is to adopt a variational approach

(Kingma & Welling, 2013), where an auxiliary distribution is learned for the inference of the latent variables. As done in (Krishnan et al., 2016) for the inference in sequences, a smoothing strategy could be developed by relying on a bi-directional RNN that would consider past and future infections for the inference of the ancestors of nodes via for every infected node in an episode . However, learning the parameters of such a distribution is particularly difficult (episodes of different lengths, cascades considered as sequences, etc.). Also, another possibility for smoothing would be to define an independent distribution for every episode and every infection . However, this induces a huge number of variational parameters, increasing with the size of the training set (linearly in the number of training episodes and quadratically in the size of the episodes). Thus, we propose to rather rely on the conditional distribution of ancestors given the past for sampling (i.e, ), which corresponds to a filtering inference process.

From the Jensen inequality on concave functions, we get for a given episode :


where . This leads to a lower-bound of the log-likelihood, which corresponds to an expectation from which it is easy to sample: at each new infection of a node in a episode , we can sample from a distribution depending on the past only. Maximizing this lower-bound (also called the ELBO) encourages the process to choose trajectories that explain the best the observed episode. To maximize it via stochastic optimization, we refer to the score function estimator (Ranganath et al., 2014), which leverages the derivative of the log-function () to express the gradient as an expectation from which we can sample. Another possibility would have been to rely on the Gumbel-Softmax and the Concrete distribution with reparametrization such as done in (Maddison et al., 2016), but we observed greatly better results using the log-trick. The gradient of the ELBO function for all the episodes is given by:


where is a shortcut for and

is a moving-average baseline of the ELBO per training episode, used to reduce the variance (taken over the 100 previous training epochs in our experiments). This stochastic gradient formulation enables to obtain unbiased steepest ascent directions despite the need to sample the ancestor vectors for the computation of the node states (with the replacement of expectations by averages over

samples for each episode). It contains two terms: while the first one encourages high conditional probabilities for ancestors that maximize the likelihood of the full episodes, the second one leads to improve the likelihood of the observed infections regarding the past of the sampled diffusion path.

The optimization is done using the ADAM optimizer (Kingma & Ba, 2014) over mini-batches of

episodes ordered by length to avoid padding (

and in our experiments). Our full efficient algorithm is given in appendix.

4 Experiments

4.1 Setup

We perform experiments on two artificial and three real-world datasets:

  • Arti1: Episodes generated on a scale-free random graph of 100 nodes. The generation process follows the CTIC model. But rather than only one transmission probability parameter per edge, we set 5 different depending on the diffusion nature. Before each simulation a number is sampled, which determines the parameters to use. 10000 episodes for training, 5000 for validation, 5000 for testing. Mean length of the episodes=7.55 (stdev=5.51);

  • Arti2: Episodes sampled on the same graph as Arti1, also with CTIC but where each is a function of the transmitted content and the features of the receiver . A content is sampled from a Dirichlet with parameter before each simulation and the sigmoid of the dot product between this content and the edge features determines the transmission probabilities. Features of the hub nodes (nodes with a degree greater than 30) are sampled from a Dirichet with (multi-content nodes), while those of other nodes are sampled from a Dirichet with (content-specific nodes). 10000 episodes for training, 5000 for validation, 5000 for testing. Mean length of the episodes=6.89 (stdev=7.7).

  • Digg: Data collected from the Digg stream API during one month. Infections are the ”diggs” posted by users on links published on the media. We kept the 100 most active users from the collected data. 20000 episodes for training, 5000 for validation, 5000 for testing. Mean length of the episodes=4.26 (stdev=9,26).

  • Weibo: Retweet cascades extracted from the Weibo microbloging website using the procedure described in (Leskovec et al., 2009). The dataset was collected by (Fu et al., 2013). 4000 nodes, 45000 episodes for training, 5000 for validation, 5000 for testing. Mean length of episodes=4.58 (stdev=2.15).

  • Memetracker: The memetracker dataset described in (Leskovec et al., 2009) contains millions of blog posts and news articles. Each website or blog stands as a user, and we use the phrase clusters extracted by Leskovec et al. (2009) to trace the flow of information. 500 nodes, 250000 for training, 5000 for validation, 5000 for testing. Mean length of episodes=8.68 (stdev=11.45).

We compare our model recCTIC to the following temporal diffusion baselines:

  • CTIC: the Continuous-Time Independent Cascade model in its original version (Saito et al., 2009);

  • RNN: the Recurrent Temporal Point Process model from (Du et al., 2016) where episodes are considered as sequences that can be treated with a classical RNN outputting at each step the probability distributions of the next infected node and its time-stamp;

  • CYAN: Similar to RMTPP but with an attention mechanism to select previous states (Wang et al., 2017b);

  • CYAN-cov: The same as Cyan but with a more sophisticated attention mechanism using an history of attention states, to give more weights to important nodes;

  • DAN: the attention model described in

    (Wang et al., 2018). It is very similar to CYAN but uses a pooling mechanism rather than a recurrent network to consider the past in the predictions. In the version of (Wang et al., 2018), the model only predicts the next infected node at each step, not its time of infection. To enable a comparison with the other models, we extended it by adding a time prediction mechanism similar to the temporal head of CYAN.

  • EmbCTIC: a version of our model where the node state is replaced in the diffusion probability computation (eq. 3) by a static embedding for the source (similarly to the formulation of the delay parameter in eq. 4). This corresponds to an embedded version of CTIC, similarly to the embedded version of DAIC from (Bourigault et al., 2016).

Note that to adapt baselines based on RNN for diffusion modeling and render them comparable to cascade-based ones, we add a ”end node” at the end of each episode before training and testing them. In such a way, these models are able to model the end of the episodes by predicting this end node as the next event (no time-delay prediction for this node however).

Our model and the baselines were tuned by a grid search process on a validation set for each dataset (although the best hyper-parameters obtained for Arti1 remained near optimal for the other ones). For every model with an embedding space (i.e., all except CTIC), we set its dimension to (larger dimensions induce a more difficult convergence of the learning without significant gain in accuracy). The reported results for our model use a GRU module as the recurrent state transformation function .

We evaluate our models on three distinct tasks:

  • Diffusion modelling: the performances of the methods are reported in term of negative log-likelihood of the test episodes (i.e., ). Lower values denote models that are less surprised by test episodes than others, rendering their generalization ability. The NLL measure depends on the model, but for each it renders the probability of an episode to be observed according to the model, both on which nodes are eventually infected and at what time. For our model which has to sample trajectories, the NLL is approximated via importance sampling by considering computed on infector vectors sampled from . We used in our experiments;

  • Diffusion generation: the models are compared on their ability to simulate realistic cascades. The aim is to predict the marginal probabilities of nodes to be eventually infected. The results are reported in term of Cross-Entropy (CE) taken over the whole set of nodes for each episode: , where stands for the indicator function returning 1 if its argument is true, 0 else. is estimated via Monte-Carlo simulations (following the generation process of the models and counting the rate of simulations in which is infected). 1000 simulations are performed for each test episode in our experiments.

  • Diffusion Path prediction: the models are assessed on their ability to choose the true infectors in observed diffusion episodes. This is only considered on the artificial dataset for which we have the ground truth on who infected whom. The INF measure corresponds to the expectation of the rate of true infectors chosen by the models: , with the true infector of the -th infected node in the episode . For RNN, their is no selection mechanism, it is excluded from the results for this measure. For models with attention (CYAN and DAN), we consider the attention weights as selection probabilities. For cascade based models which explicitly model this, we directly use the corresponding probability . In our model, this corresponds to an expectation over previous infectors in the cascade (i.e., ), with the -th sampled vector from .

For each task, we report results with different amounts of initial observations from test episodes: infections occurred before a given delay from the start of the episode are given as input to the models, from with they infer internal representations, evaluation measures are computed on the remaining of the episode. In tables 1 to 4, 0 means that nothing was initially observed, the models are not conditioned on the start of the episodes. 1 means that infections at the first time stamp are known beforehand, prediction and modeling results thus concern time-stamps greater than 1 (models are conditioned on diffusion sources). 2 and 3 mean that infections occurred respectively before a delay of and a delay of from the start of the episode are known and used to condition the models. Details on how conditioning our model and the baselines w.r.t. starts of episodes are given in the appendix.

4.2 Results

Arti 1 NLL 0 1 2 3 rnn 25,99 23,9 16,03 11,35 cyan 27,85 25,82 17,71 12,67 cyan-cov 26,68 24,58 16,64 11,84 dan 24,77 22,69 14,96 10,5 ctic 21,00 18,56 10,81 7,48 embCTIC 20,96 18,53 10,78 7,49 recCTIC 19,62 14,42 9,87 6,87 CE 0 1 2 3 rnn 0,47 0,28 0,18 0,14 cyan 0,40 0,28 0,21 0,16 cyan-cov 0,50 0,27 0,18 0,14 dan 0,40 0,93 0,69 0,44 ctic 0,31 0,36 0,65 0,45 embCTIC 0,47 0,35 0,25 0,18 recCTIC 0,31 0,23 0,12 0,08 INF 0 1 2 3 cyan 0,38 0,26 0,29 0,32 cyan-cov 0,39 0,27 0,29 0,32 dan 0,42 0,26 0,22 0,21 ctic 0,64 0,54 0,55 0,58 embCTIC 0,62 0,51 0,49 0,51 recCTIC 0,65 0,56 0,58 0,62 Arti 2 NLL 0 1 2 3 rnn 19,72 14,55 11,78 7,65 cyan 18,78 13,62 11,04 7,09 cyan-cov 19,55 14,35 11,56 7,41 dan 18,71 13,49 10,86 6,84 ctic 20,08 14,26 10,18 5,99 embCTIC 19,90 14,13 10,11 5,97 recCTIC 17,39 11,59 8,31 4,97 CE 0 1 2 3 rnn 79,0 21,0 14,2 9,3 cyan 86,9 19,6 14,4 9,4 cyan-cov 91,7 27,8 19,7 12,4 dan 97,9 98,6 75,7 43,8 ctic 63,1 23,2 24,7 21,8 embCTIC 65,8 22,6 17,2 12,1 recCTIC 68,7 15,9 11,2 8,3 INF 0 1 2 3 cyan 0,30 0,15 0,11 0,11 cyan-cov 0,29 0,14 0,10 0,10 dan 0,42 0,31 0,23 0,20 ctic 0,73 0,67 0,61 0,58 embCTIC 0,73 0,67 0,61 0,58 recCTIC 0,90 0,88 0,86 0,84
Table 1: Results on the artificial datasets.

Results on the two artificial datasets are given in table 1. While our approach shows significant improvements over other models for NLL and CE results on the Arti1 dataset (except for CE with weak conditioning), its potential is fully exhibited on the Arti2 dataset, where embedding the history for predicting the future of diffusion looks of great importance. Indeed, in this dataset, there exists some hub nodes through which most of the diffusion episodes transit, whatever the nature of the diffusion. In that case, the path of the diffusion contains very useful information that can be leveraged to predict infections after the hub node: the infection of the hub node is a necessary condition for the infection of its successors, but not sufficient since this node is triggered in various kinds of situations. Depending on who transmitted the content to it, different successors are infected then. Markovian cascade models such as CTIC or embCTIC cannot model this since infection probabilities only depend on disjoint past events of infection, not on paths taken by the propagated content. RNN-based models are better armed for this, but their performances are undermined by their way of aggregating past information. Attention mechanisms of CYAN and DAN attempt to overcome this, but it looks quite unstable with errors accumulating through time. Our approach appears as an effective compromise between both worlds, by embedding past as RNN approaches, while maintaining the bayesian cascade structures of the data. Its results on the INF measure are very great compared to the other approaches. This is especially true on the Arti2 dataset, which highlights its very good ability for uncovering the dynamics of diffusion, even for such complex problems with strong entanglement between diffusions of different natures.

The good behavior of the approach is not only observed on the artificial datasets, which have been generated by the cascade model of CTIC on a graph of relationships, but also on real-world datasets. Tables 2 to 4 report results on the three real-world datasets. In these tables, we observe that RNN based approaches have more difficulties to model test episodes than cascade based ones. The attention mechanism of the CYAN and DAN approaches allow them to get sometimes closer to the cascade based results (especially on Digg), but their performances are very variable from a dataset to another. These methods are good for the task they were initially designed - predicting the directly next infection (this had been observed in our experimentations)-, but not for modeling or long term prediction purposes. This is a strong limitation since the directly next infection does not help much to understand the dynamics and to predict the future of a diffusion. Our approach obtains better results than all other methods in most of settings, especially for the dynamics modelling task (NLL), though infection prediction results (CE) are also usually good compared with its competitors. Interestingly, while embCTIC usually beats CTIC, recCTIC often obtains even better results. This validates that the history of nodes in the diffusion has a great importance for capturing the main dynamics of the network. Thanks to the black-box inference process and the recurrent mechanism of our proposal, the propositional distribution is encouraged to resemble the conditionnal probability of the full ancestors vector. Regarding the results, the inference process looks to have actually converged toward useful trajectories. This enables the model to adapt distributions regarding the diffusion trajectory during learning. This also allows the model to simulate more consistent cascades regarding sources of diffusion.

NLL 0 1 2 3 rnn 31,03 26,18 24,15 23,02 cyan 16,82 11,56 9,32 8,03 cyan-cov 21,36 16,83 14,50 13,31 dan 18,47 13,52 11,43 10,29 ctic 27,70 22,07 19,20 17,74 embCTIC 15,98 10,31 7,92 6,75 recCTIC 15,67 10,30 7,86 6,74 CE 0 1 2 3 rnn 0,46 0,28 0,24 0,22 cyan 0,43 0,25 0,21 0,19 cyan-cov 0,43 0,23 0,20 0,17 dan 0,44 0.25 0,22 0,19 ctic 0,45 0,31 0,20 0,16 embCTIC 0,43 0,31 0,21 0,17 recCTIC 0,43 0,27 0,17 0,14
Table 2: Negative Log-Likelihood (NLL) and Cross Entropy of Infections (CE) on Digg.
NLL 0 1 2 3 rnn 27,58 28,98 18,62 17,15 cyan 29,59 30,04 30,04 18,79 cyan-cov 27,50 29,12 29,12 18,55 dan 32,35 25,02 21,97 20,39 ctic 23,92 17,88 13,31 12,28 embCTIC 24,71 18,02 13,58 12,39 recCTIC 21,72 14,08 11,29 10,34 CE 0 1 2 3 rnn 0,59 0,37 0,30 0,28 cyan 0,59 0,37 0,31 0,29 cyan-cov 0,59 0,36 0,30 0,28 dan 0,58 0,26 0,26 0,24 ctic 0,58 0,25 0,31 0,28 embCTIC 0,59 0,26 0,30 0,28 recCTIC 0,59 0,24 0,21 0,20
Table 3: Negative Log-Likelihood (NLL) and Cross Entropy of Infections (CE) on Weibo.
NLL 0 1 2 3 rnn 112,3 118,2 110,4 103,6 cyan 115,2 113,1 109,2 102,1 cyan-cov 95,58 95,05 93,64 90,20 dan 91,70 89,97 86,19 78,91 ctic 52,70 55,54 48,48 44,33 embCTIC 54,18 52,29 49,68 45,15 recCTIC 50,11 49,34 48,35 42,20 CE 0 1 2 3 rnn 1,68 1,66 1,59 1,51 cyan 1,66 1,64 1,59 1,49 cyan-cov 1,61 1,59 1,52 1,39 dan 1,59 1,58 1,58 1,44 ctic 1,33 1,68 1,60 1,46 embCTIC 1,59 1,66 1,57 1,39 recCTIC 1,51 1,60 1,49 1,36
Table 4: Negative Log-Likelihood (NLL) and Cross Entropy of Infections (CE) on Memetracker.

5 Conclusion

We proposed a recurrent cascade-based diffusion modeling approach, which is at the crossroads of cascade-based and RNN approaches. It leverages the best of both worlds with an ability to embed the history of diffusion for prediction while still capturing the tree dependencies underlying the diffusion in network. Results validate the approach both for modeling and prediction tasks.

In this work, we based the sampling of trajectories on a filtering approach where only the past observations are considered for the inference of the infector of a node. Outgoing works concern the development of an inductive variational distribution that rely on whole observed episodes for inference.


6 Appendix

6.1 Joint Probability

In this section, we detail the derivation of whose formulation is given in equation 11. For each infected node at position , we need to compute:

  • the probability for for being infected at its time of infections given the nodes previously infected in and the states associated to these nodes;

  • the probability of the ancestor index given the -th infection , and the previous infections associated to their states ;

  • the probability that not infected nodes are actually not infected by the -th infected node given its state.

This gives:


6.2 Generation Process

The generation process of our model is given in algorithm 1. The process iterates while there remains some nodes in a set of infectious nodes (initialized with ). denotes the concatenation between two lists. At each iteration, the process selects the infectious node with minimal time-stamp of infection (all time-stamps but are initialized to ), removes it from the infectious set and records its infector and infection time-stamp in the cascade. Then, for each node with time-stamp greater than the one of , attempts to infects according to the probability (computed with eq 3). If it succeeds, is inserted in the set of infectious nodes and a time is sampled for from an exponential law with parameter . If the new time for is lower than its stored time , this new time is stored in , is stored as the infector of in the table (used to build ) and the new state is computed according to its new infector . The generation process outputs a cascade structure (as described in the introduction of the previous section). From the classical CTIC, the only changes are at lines 1, 1 and 1, respectively for the computation of , and .

1 Input: ,
2 for  do
4 end for
5; ; ; ; ;
6 while  do
7       ;
8       ;
9       ;
10       ;
11       for  do
12             ;
13             if  then
14                   ;
15                   ;
16                   if  then
17                         ;
18                         ;
19                         ;
20                         ;
22                   end if
24             end if
26       end for
28 end while
Output: ;
Algorithm 1 Cascade Generation Process

6.3 Learning Process

The learning process of our model is depicted in algorithm 2. In this algorithm, the function first creates minibatches by ordering in decreasing length and cutting this ordered list in bins of episodes each. Each bin contains 3 matrices with rows (except in the last bin which contains matrices for the remaining episodes):

  • : a matrix where the cell () contains the -infected node in the -th episode of the bin, or if the corresponding episode contains less than infected nodes. The width of the matrix is equal to the number of infected nodes in the longest episode in the bin (the episode in the first row of the matrix);

  • : a matrix where the cell () contains the infection time-stamp of the -infected node in the -th episode of the bin, or if the corresponding episode contains less than infected nodes. The width of the matrix is equal to the number of infected nodes in the longest episode in the bin (the episode in the first row of the matrix);

  • : a binary matrix with columns where the cell () equals if the node is infected in the -the episode of the bin, otherwise;

At each epoch, the algorithm iterates on every bin. For each bin, it first initializes the states of the infected nodes using a function

which produces a tensor

of matrices whose each row is filled by (with and respectively the number of rows and columns in matrix ). For every step of infection in the bin, the prosess first determines in the rows of the matrices which correspond to not ended episodes ( refers to the column of ). Then, if the step is not the initial step , it uses functions and with nodes previously infected for each episode associated to their corresponding states . While the function returns a matrix where the cell contains the log-probability for the -th node in the -th episode to infect at its infection time-stamp (using a matrix version of equation 5), the function returns a same shape matrix where the cell contains the log-probability that the -th node in the -th episode does not infect before its infection time-stamp (using a matrix version of equation 6).

Then, ancestors at step are sampled from categorial distributions parameterized by

(deduced from logits

). From them, we compute the log-probability for each infected at step to be actually infected at their time-stamp of infection by its corresponding sampled infector. (line 2, where is a function which returns the vector of the sums of each row from ). This quantity is added to the accumulator .

Line 2 then computes the states for the nodes infected at step according to the states of the sampled ancestors in (via the function which is a matrix version of equation 2).

At the end of each iteration , the log-likelihood that not infected nodes in are actually not infected by infected nodes at step is computed via , which is a matrix version of equation 9. This quantity is added to the accumulator .

At the end of the bin (when ), a control variate baseline is computed by maintaining a list of the quantity vectors considered in . The baseline considered in the stochastic gradient for any episode is thus equal to the average of for this specific episode taken over the previous epochs.

Finally, the gradients are computed and the optimizer ADAM is used to update the parameters of the model. Note that this algorithm does not use the gradient update given in eq. 14. It is based on and for every (rather than based on the simplification as given in eq. 13). This is equivalent but greatly more efficient since in both cases needs to be estimated for sampling and is much easier to compute than ( involves a simple product while involves a sum of products).

1 Input: , , , , ,
2 ;
3 for  do
4       ;
5       for (Inf,Times,NotInf) in bins do
6             ; ;
7             ;
8             for  do
9                   ;
10                   if  then
11                         ;
12                         ;
14                         # Sample from
15                         ;
16                         ;
18                         # Compute
19                         ;
20                         ;
22                         ;
24                   end if
25                  ;
27             end for