A hierarchical model of non-homogeneous Poisson processes for Twitter retweets

02/06/2018 ∙ by Clement Lee, et al. ∙ 0

We present a hierarchical model of non-homogeneous Poisson processes (NHPP) for information diffusion on online social media, in particular Twitter retweets. The retweets of each original tweet are modelled by a NHPP, for which the intensity function is a product of time-decaying components and another component that depends on the follower count of the original tweet author. The latter allows us to explain or predict the ultimate retweet count by a network centrality-related covariate. The inference algorithm enables the Bayes factor to be computed, in order to facilitate model selection. Finally, the model is applied to the retweet data sets of two hashtags.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical modelling of online social media such as Twitter has become increasingly popular, because of the richness and availability of the data in temporal and topological aspects. As an introduction to the model for Twitter retweets proposed in this article, we will give a brief review for each of the two aspects.

1.1 Temporal dynamics

A common approach to modelling temporal dynamics is the use of one-dimensional non-homogeneous Poisson Processes (NHPP), which are one prominent kind of stochastic process. Specifically, if a sequence of events is assumed to arise from a NHPP with intensity function

, the random variable of the number of events within the interval

will follow a Poisson distribution with mean

, and is independent of the random number of events in any other disjoint interval. The special case where is constant over time is called the homogeneous Poisson process (HPP).

Examples of using the HPP for Twitter data include Sakaki et al. (2010), Perera et al. (2010), Kumar et al. (2014, 2015), and Mahmud et al. (2013), but it usually does not describe data realistically because it assumes the interarrival times of events are independent and identically distributed (iid) exponential random variables. Sanli and Lambiotte (2015) observe bursty dynamics and temporal fluctuations in tweets with hashtag #ledebat, over the two-week period leading to the 2012 French presidential election, and show the departure of the data from one simulated from a HPP without fitting a stochastic process. It is therefore natural that the more general NHPP, or extensions thereof, areis more often used in the literature. For example, in Smid et al. (2011), a NHPP is being used for semi-supervised detection of an anomaly in pollution-related tweets.

Quite often the intensity function is specifically designed or chosen to capture temporal patterns observed in the data. For example, Mathiesen et al. (2013) and Mollgaard and Mathiesen (2015) both fit a NHPP to the occurrences of international brand names on Twitter, which exhibit strongly correlated userused behaviour and bursty collective dynamics over time. The former incorporate long range temporal correlations in , resulting in interarrival times that are marginally distributed according to the power law, while the latter consider as a product of stochastic global user interest and approximately deterministic user activity over time. Such a way of splitting into two components is also seen in Shen et al. (2014), and Mathews et al. (2017). The former fit a NHPP to the popularity dynamics of Twitter hashtags and Physical Review papers, where is a product of a decreasing function of time and a term increasing linearly with the number of events, intended to capture the effect of how the attractiveness of an individual item ages, and the effect of preferential attachment, respectively. The latter fit a NHPP to the retweets of popular Twitter users with proportional to the product of and , and attempt to explain the two components by a decision-based queueing process, rather than preferential attachment, and loss of interest over time, respectively. The NHPP with this specific deterministic form of is termed the hybrid process, which will be the cornerstone of our proposed model.

While relaxing the stationarity assumption of the HPP leads to the NHPP, the former, as one of the simplest point processes, has many other properties (and therefore hashave alternative characterisations), such as the interarrival times being independent and identically distributed (i.i.d.) exponential random ariables, the independence of the random variables of the numbers of points in disjoint intervals, and that points are i.i.d. over a bounded region of time or space conditional on the total number of points in that region. This allows defining broader classes of point processes by generalising the HPP in various ways. For example, relaxing the exponential distribution assumption for the interarrival times gives rise to the renewal processes. Further allowing that the interarrival times form a Markov chain leads to the Wold process. Alternative joint distributions being specified for the points in a bounded domain given the total number of points leads to finite point processes. All these processes can be found in, for example,

Daley and Vere-Jones (2003).

Apart from generalising the HPP in various ways, extensions can be made for point processes in general. For example, if there are measurements (marks) associated with the locations of the points, the observed data can be modelled by a marked point process, in which the locations are modelled by a point process and the marks are further modelled by a distribution, possibly conditional on the locations. Also, while the (first-order) intensity function characterises the NHPP, higher-order intensity functions, which are defined in a similar way, can be used to complement the characterisation of other processes, such as the Markov point process, also known as the Gibbs process. For further references of these processes, please see, for example, Daley and Vere-Jones (2003) and Diggle (2013).

Another common approach to extending a NHPP is the incorporation of stochasticity in . This means that events are assumed to arise from a NHPP conditional on , which in turns arises from a separate stochastic process. One prominent example is the log-Gaussian Cox process (Møller et al., 1998). Regarding the models for Twitter data relevant to our research, one example is the aforementioned model by Mollgaard and Mathiesen (2015). Pozdnoukhov and Kaiser (2011) use a Markov-modulated Poisson process, in which varies according to a Markov process, in their application of identification and spatio-temporal analysis of topics on Twitter. Bao et al. (2015) propose a self-excited Hawkes process (SEHP), in which jumps simultaneously when an event occurs and decays before the next event occurs, and argue that such their model outperforms the one by Shen et al. (2014), in terms of prediction accuracy, when applied to the same set of data.

The SEHP is being used widely in different fields to capture triggering and clustering behaviour (Reinhart, 2018). One similar context to modelling Twitter data is earthquake modelling in seismology. Ogata (1988) developed the temporal epidemic-type aftershock sequence (ETAS) model, in which mainshocks arise from a NHPP called the background process, and aftershocks arise from a different NHPP triggered by the occurrence of a mainshock or an aftershock. While the Twitter original tweets and retweets can be seen as analogous to the mainshocks and aftershocks in the ETAS model, respectively, subtle differences to our proposed model exist and will be explained in Section 3. The spatio-temporal version of the ETAS model is reviewed in Chiodi and Adelfio (2017), with a focus on inference approaches.

One final aspect of extending a temporal NHPP or general point process is its spatial counterpart or spatio-temporal generalisation. The spatial component exists in some of the references above, such as Reinhart (2018) for the SEHP, but is not reviewed extensively here due to the temporal nature of our research. For further references, please see Cressie (1993), Illian et al. (2008), Daley and Vere-Jones (2008), Gelfand et al. (2010), Cressie and Wikle (2011), Diggle (2013) and Baddeley et al. (2015).

1.2 Topological Aspects

Twitter data usually comes with information such as number of followers or even who follows whom, that is, the directed edges in the user network, therefore enabling modelling of its social network, static or dynamic. For example, Bhamidi et al. (2015) collect tweets of specific topics associated with competing hashtags, and observe the departure of the degree distribution of the retweet network from one predicted by the classical preferential attachment model (Barabási and Albert, 1999)

. They propose a variant called the Superstar model, in which a vertex enters the network by connecting to either the lone superstar, with the same probability across all vertices, or the rest of the network otherwise, according to original preferential attachment rule.

Whenever the data permits, it is natural to extend a model to account for the temporal dynamics and the network structure simultaneously, see, for example, Li et al. (2014). Xie et al. (2011) and Wu et al. (2011) build a framework for data on Sina weibo, a Chinese counterpart of Twitter, that divides users into communities and models information generation, receiving, and processing and diffusion by the power law, a HPP, and a multiplicative model of individual reading habits and relation strength, respectively. It is also possible to model network structure and information diffusion simultaneously without using time as a dimension. Both Li et al. (2012) and Nishi et al. (2016) use a Galton-Watson branching process model, for data of video contents shared on online social networks, and reply trees in Twitter, respectively.

Usually and implicitly assumed in the models aforementioned is that the network, if concerned, remains unchanged throughout the observation period, which may be unrealistic for Twitter data given the ease of following other users. Therefore efforts have been made to model the dynamics of information diffusion and network evolution simultaneously, as it is natural to conjecture that they co-evolve over time. Antoniades and Dovrolis (2015) propose a tweet-retweet-follow model, which is characterised by events of a follower of a retweeter becoming also a follower of the original tweet author, conditional on the original tweet being created and retweeted. Farajtabar et al. (2015) consider the follower and the retweet adjacency matrices, and model the co-evolution through a system of dynamic equations of these two matrices. Srijith et al. (2017) assume no network evolution but incorporate the influence between users according to the network in a multivariate Hawkes process model, in which each user-topic pair has its own intensity function. Lim et al. (2016) introduce a Twitter-Network topic model, which comprises a hierarchical Poisson-Dirichlet process model for the text and hashtags, and a Gaussian process based random function model for the followers network, and is applied to a data set of tweets with certain keywords.

Instead of modelling temporal and network dynamics merely according to some stochastic processes, one can look into how network summary measures, such as number of followers, and other variables affect either or both of them, thus identifying useful covariates for predictions for retweet behaviour and network influence. Sutton et al. (2014) employ a negative binomial regression model for the retweet count, to investigate how message content/style and public attention to tweets relate to the retweet activity in a disaster. Zhu et al. (2011)

apply a logistic regression model to the data of whether a tweet is being retweeted from the point of view of a

follower. Hong et al. (2013) include a regression part in their co-factorization machines, which are for discovering topics users are interested in. They suggest that both network measures and content are important in determining retweets. However, the relationships among the covariates are not reported in all three analyses, thus presenting the risk of potential collinearity and overfitting.

Commonly observed and modelled in the aforementioned literature is the power law phenomenon. Examples in temporal aspects include interarrival times (Götz et al., 2009, Xie et al., 2011, Wu et al., 2011), and tweet or retweet rate (Mathiesen et al., 2013, Mathews et al., 2017), while examples in topological aspects include network degree (Li et al., 2014) and size and depth of reply trees (Nishi et al., 2016). Regarding network influence, the power law has been observed in retweet count (Hong et al., 2013), count of in-links for blogs (Götz et al., 2009), view count of videos (Miotto et al., 2017), and citation count (Shen et al., 2014). Interestingly, there have been no studies on the relationships among these variables following the power law.

1.3 Proposed model

The research reported in this article stemmed from investigating all tweets (both original and retweets) with two specific hashtags. Compared to the analysis by Shen et al. (2014), we dig one level deeper as we model the retweets by a collection of NHPPs conditional on the existence of the original tweets, simply called the originals hereafter. The specific NHPP used in the modelling is based on the hybrid process as in Mathews et al. (2017), but without resorting to discretising the data when it comes to inference. While it is straightforward to fit a hybrid process to the originals, as we will illustrate in Section 2, the novelty of this article is the hierarchical modelling of retweets. Specifically, all retweets of each original are modelled by a NHPP, with a latent component in

that depends on the follower count (of the author of the original) and determines the ultimate retweet count. All the retweet processes are in turn enveloped in one single hierarchical model so that information can be pooled to estimate the parameters.

There are a few merits of including follower count, which is essentially the in-degree of a user, and retweet count in the way described above, both of which are observed to follow the power law empirically. First, it presents a network centrality-related covariate as the potential driving force of retweet behaviour or network influence, while simultaneously modelling the temporal dynamics. Second, such a way of incorporating network summary measures enables us to capture any effect attributed to the power law phenomenon, while avoiding the overhead of an explicit network structure, which is usually computationally expensive to construct. Furthermore, directly using the follower count, which can vary over the observation period even for the same user, already partially accounts for the effect of network evolution over time. Finally, this model is generative in the sense that, conditional on the follower count of the authors of the originals (which can be easily generated by the power law), we can simulate a realistic process of processes, each of which corresponds to how retweets of a particular original grow over time.

The rest of the article is divided as follows. Introduced and explored in Section 2 are the two data sets, the subsets of which are fitted by the hybrid process and its special case. The hierarchical model for retweets is introduced in Section 3, with its likelihood derived. The inference algorithm and the model diagnostics procedure are outlined in Sections 4 and 5, respectively. The model is applied to the two previously introduced data sets in Section 6, as well as a simulated data set in Section 7 to show that the model is realistic and that the inference algorithm performs well. Section 8 concludes the article.

2 Data and exploratory analysis

A set of tweets (both originals and retweets) with the hashtag #thehandmaidstale was collected on 2017-06-14 for 21 hours after one episode of the relevant TV series was broadcasted. There are in total 2043 originals, 265 of which have been retweeted at least once during the observation period. The times of the originals are plotted on the left of Figure 1 usingthrough the means of a histogram, with bins of 10 minutes. For these 265 originals there are 971 retweets, while the most retweeted original, called the top original hereafter, has been retweeted 204 times. For each original, the cumulative retweet count is plotted over time on the left of Figure 2, meaning that there are 265 trajectories of different colours in total.

To examine potentially different tweeting behaviour of different hashtags, a set of tweets with the hashtag #gots7 was collected on 2017-07-16 for around 4.4 hours before the 7 season premiere of the TV series Game of Thrones was broadcasted. The numbers of originals, originals retweeted, total retweets and retweets of the top original are provided in Table 1 alongside the counterparts for #thehandmaidstale data reported above. The histogram of originals is plotted on the right of Figure 1, while the cumulative retweet counts over time for each original are plotted on the right of Figure 2.

#thehandmaidstale #gots7
Originals 2043 25420
Originals retweeted 265 3145
Total retweets 971 29751
Retweets of top original 204 3204
Table 1: Summaries of originals and retweets for the two data sets.
Figure 1: Histogram of originals over time for the #thehandmaidstale (left) and #gots7 (right) data.
Figure 2: Cumulative retweet counts over time for #thehandmaidstale (left) and #gots7 (right) data. Each trajectory is of a different colour and represents the growth of retweets of one individual original.

2.1 Hybrid process and the Duane plot

Another way of visualising the temporality of the data is through the Duane plot (Duane, 1964). In order to do so we have to first introduce the hybrid process and the power law process. Consider a NHPP with intensity function


where and , which is called the hybrid process hereafter. It is equivalent to the “power law with exponential cutoff” function by Mathews et al. (2017), but is different from the doubly stochastic processes introduced in Section 1 as is deterministic. The cumulative intensity is given by

where is the lower incomplete Gamma function such that is equal to the Gamma function . Now assume a sequence of events is generated from the hybrid process in the time interval , in which the -th event occurs at time , so that It is straightforward to write down the likelihood function:


The derivation of the likelihood for general temporal point processes, which also applies to (2), can be found in, for example, Daley and Vere-Jones (2003), Proposition 7.2.III. When , the hybrid process becomes the power law process (Bar-Lev et al., 1992). Each of the interarrival times follow a truncated Weibull distribution (Yakovlev et al., 2005). The power law process is different from a renewal process with power law distributed interarrival times, such as the event-modulated Poisson process termed by Masuda and Rocha (2017).

Going back to the aforementioned sequence of events, if we want to check if it is generated by the power law process, we can fit the hybrid process and then formally test whether is 0 using somewhatever estimation approach. However, there is also a diagnostic plot for checking whether the power law process is appropriate with no model fitting required. Observe that the expected number of events at , denoted by , should be close to , where is the number of events in the interval . Under the power law process, the former is given by , and so, if it is equal to , we have


This means that plotting , which is termed mean time between failures (MTBF) in reliability theory, against on the log-log scale should give approximately a straight line with slope and intercept . This is called the Duane plot (Duane, 1964), which serves as a useful tool for diagnosing if the power law process describes the data well, and is similar to judging whether the Weibull distribution is useful according to the survival log-log plot in survival analysis.

Figure 3: Duane plots of originals (left), where time is relative to the start of observation period, and retweets of the top original (right), where time is relative to when the original is tweeted, for #thehandmaidstale data. Overlaid is the theoretical line (dashed) according to (4) with .

The Duane plot for the 2043 originals of #thehandmaidstale data is shown on the left of Figure 3, where linearity is only observed at certain intervals. We also formally fit the power law process to the data, by maximising the (log-)likelihood in the second line of (3) with respect to and simultaneously, yielding , where denotes the maximum likelihood estimate (MLE) for any parameter , and a maximised log-likelihood of -9422.9. Fitting the hybrid process instead gives and a maximised log-likelihood of -9336.4. While the difference in the maximised log-likelihood between the power law process and the hybrid process suggests that the former is inadequate, the latter does not necessarily describe the data well enough.

On the right of Figure 3 is the Duane plot for the retweets of the top original (with 204 retweets) of #thehandmaidstale data, which shows linearity over the whole observation period apart from a few small troughs and a seemingly increasing positive slope, suggesting that the power law process may be sufficient compared to the more general hybrid process. This is confirmed by fitting the latter to the data, which gives , and no reduction in the maximised log-likelihood compared to the respective power law process fit. The dashed line overlaying the Duane plot represents (4) with , and its proximity provides further support to the adequacy of the power law process.

Figure 4: Duane plots of originals (left) and retweets of the top original (right) for the #gots7 data, overlaid by the theoretical line (dashed) according to (4).

For the 25420 originals of #gots7 data, fitting the power law process gives , while the hybrid process does not improve the fit with the same point estimates and . For the 3204 retweets of the top original, fitting the power law process gives maximised log-likelihood -5545.8 and , which are used to obtain the theoretical line overlaid in the Duane plot in Figure 4. The slight concavity of the Duane plot in the overlapping interval indicates possible inadequacy of the power law process and potential improvement by the hybrid process, which is supported by fitting the latter to obtain maximised log-likelihood -5401.1 and .

For each of the two hashtags considered, whether the power law process is adequate for the retweets of the top original should not be assumed to automatically apply to the retweets of every other original. For exploratory purposes, we overlay the Duane plots of retweets of the top 13 originals in Figure 5, all with over 300 retweets. That the slope is more similar across different Duane plots than the position is suggests that respective fits by the power law process (or the hybrid process) will give more similar estimates of than of . In terms of actual modelling, we will generalise the hybrid process in the hierarchical model for the retweets introduced in the next section, and formally test whether , which will be universal to all originals, is equal to 0 in Section 4 through model selection.

Figure 5: Duane plots of retweets of the top 13 originals for the #gots7 data. The thicker the line is, the more retweets the original has.

The temporality of the originals and retweets aside, we are also interested in explaining the retweet count, which is the outcome of the process generating retweets, by the follower count, which is observed once the original is tweeted. We assume that the -th original is tweeted at time , when the author of which has followers. At time , the end of the observation period, there are retweets observed for this original. Next, we define and to be the “log” retweet count and “mean-centred” follower count, respectively. While will be used as the covariate in the hierarchical modelling, the loose definition of the “log” retweet count is to ensure is finite, which will only be used for exploratory purposes in this section. The scatterplots of against for the retweets of the aforementioned data sets are shown in Figure 6

, indicating that linear regression may be appropriate on these scales. The full model in Section

3 adheres to such relationships as it essentially models the retweet count according to a Poisson regression, in which the mean is proportional to the exponential of a linear predictor of the covariate .

Figure 6: Log retweet count against mean-centred follower count for #thehandmaidstale data (left) and #gots7 data (right).

3 Hierarchical model and likelihood

For convenience, the terminology and notation in Section 2 are retained, but no model is assumed for how the originals are generated as it is not the concern of this section. Instead, modelled are the retweets of the -th original () observed in . The times of these retweets, denoted by such that , are assumed to arise from a generalised hybrid process with intensity


where , , , , and is the indicator function of event . The cumulative intensity at time follows directly:


Several modifications from (1) can be observed in (5). First, the shift from to in is due to the process of retweets taking place relative to the time original is tweeted. Second, while and are universal to the process of each original, is dependent on and assumed to take the form


Third, there is no need for an intercept term in as it is embedded in , and replaces in (1) as the scale component for the retweet count of the -th original. Finally, the inclusion of , along with other generalisations, is for comparison with the ETAS model, proposed by Ogata (1988) and mentioned in Section 1, later in this section.

Due to the nature of the NHPP, the retweet count for the -th original at time , follows the Poisson distribution with mean (terms constant to ). Effectively and implicitly, a Poisson regression model is assumed for by , even though the former is not directly modelled. The use of the NHPP enables the retweet times to be modelled while simultaneously explaining the retweet count by the “mean-centred” follower count.

If we sum all intensities of the retweet process of individual originals, we obtain the overall intensity of a point process of all retweets:


This seems similar to the conditional intensity of the temporal ETAS model


where is the history of all events up to time . The parameterisation is slightly different from thatas usually seen in the literature (Ogata, 1988, Chiodi and Adelfio, 2017, Reinhart, 2018), for the sake of easier alignment. The apparent differences include the presence of in (10), which however is usually given in modelling earthquake data, the inclusion of an extra parameter and a random effects term , which will be justified in our application in Section 6, and the additional exponential decay term over time suggested by Mathews et al. (2017). However, more important are the major differences in the underlying model structure. First, the ETAS model jointly models both the background events (originals in our case) and the triggered events (retweets) according to one point process, mainly because the nature of the events is not known prior to modelling, while in the proposed model concerned are the retweets given the originals. Such difference can be seen in (9) that is absent, which under the ETAS model means the originals are modelled by a HPP. Second, while the times in (9) refer to the background events (originals) only, the times in (10) refer to both type of events, hence the self-exciting nature of the process. This means under the ETAS model retweets can arise from other retweets, and different retweets essentially belong to a different and unobserved layer. On the other hand, under the proposed model, there is no self-excitation and only two layers of events exist, namely the originals from the background process, and the retweets that arise as offspring of the originals. Finally, under the ETAS model the layers, as well as the association between events of different layers, are unobserved and assumed by the model structure. Under the proposed model the relationships between the two completely known layers of events are observed and modelled accordingly.

Overall, our model deviates from the temporal ETAS model by removing the self-exciting nature of retweets. While it is possible to incorporate such structure, more information is required, such as the follower counts of the retweeters, in order to compute a less tractable likelihood. See, for example, Reinhart (2018) for the complexity of the required computations. Rather, we utilise the most valuable information, which is the correspondance between the originals and the retweets, to compute a completely tractable likelihood, which will be shown towards the end of this section, under a model that elegantly encompasses a collection of NHPPs of retweets.

It is useful to define a few vectors for the derivations in the rest of this section. We write


The two vectors and correspond to the generalised power law and hybrid processes, respectively. When there is no confusion, we simply write to represent this vector of scalar parameters. While the four vectors , , and are all of length , is a vector of latent variables whereas the others are given as data/covariates. Finally, is the vector of retweet times of the -th original, while is the collection of retweet times of all originals. The length of , or equivalently the total number of retweets, is denoted by .

Before writing out the likelihood, we introduce a parameter , which can take value 0 or 1, to represent model choice. When , the hierarchical model of the generalised power law process is the true model with parameter vector , in which is set to 0 and removed. When , the hierarchical model of the generalised hybrid process with is the true model with parameter vector . By treating the nested models as two competing models, the problem of testing whether becomes a problem of model selection, which can be achieved by utilising the output of the inference algorithm outlined in Section 4. Our algorithm requires the likelihood for each model as a function of :


The derivations of (11) and (12) are detailed in Appendix A. Note that, even though is seen in neither (11) nor (12) because of independence between and the data conditional on , it is included in for notational convenience.

4 Inference and the Bayes factor

The presence of the latent variables and the problem of model selection between and

prompt us to consider Bayesian inference for the proposed hierarchical model. We first assign the following independent and vaguely informative priors:



is the variance of a random variable

, and is the mean of a random variable . As the parameter space is not the same for and , we denote as the subset of not in , which means and , the null set. Assuming conditional independence of and given , the joint posterior of , , and is


where can be simplified to and is given by (8), while the last component is the prior model probability. The “true” prior of under is given by

where the components are given by (13). For , the pseudoprior vanishes as , while for , the pseudoprior can be written as equivalently. We proceed to draw samples of using Markov chain Monte Carlo (MCMC), in which model selection is facilitated by the modified version (Dellaportas et al., 2002) of Gibbs variable selection (Carlin and Chib, 1995). The MCMC algorithm is outlined as follows:

  1. The current values in the chain are , , and .

  2. Draw from its conditional posterior, with density proportional to
    , by a fairly standard component-wise Metropolis-within-Gibbs (MWG) algorithm, the details of which are given in Appendix B. Denote the value by .

  3. Draw from its conditional posterior, with density proportional to
    , by the same MWG alogithm in Appendix B. Denote the value by .

  4. If , draw from its pseudoprior . Denote the value by , and write . If , write , that is, the proposed value of with that of dropped, so that still holds.

  5. Draw from , its conditional posterior distribution. Essentially, set to 0 and 1 with probabilities and , respectively, where, using (14),

  1. [resume]

  2. Denote the drawn value in step 5 by . The current values are now , , and .

The pseudoprior is chosen to be close to the marginal posterior of under the competing model, denoted by , for the sake of optimisation (Dellaportas et al., 2002, Carlin and Chib, 1995). It can be informed by a pilot run of the MWG algorithm in Appendix B for model individually. As the priors for the overlapping parameters are the same for both models, only the pseudoprior and the prior are involved in step 5.

The draws of in the above algorithm where marginally represent an approximate sample from , that is, its posterior distribution under that model 0 is true; likewise for . What is more important, however, is the empirical proportion of , denoted by

, as it approximates the posterior probability that model

is true. Finally, the Bayes factor is the ratio of the posterior odds to the prior odds:


An alternative to Gibbs variable selection (GVS) for model selection is reversible jump Markov chain Monte Carlo (RJMCMC) (Green, 1995), which should theoretically give the same posterior probabilities for the model choice. It will be used to verify with the results of GVS in the application, and the details of its algorithm are given in Appendix C.

5 Model diagnostics

In this section, we augment the function arguments of the cumulative intensity in (6) by writing , where , , and are included whenever necessary. Under the proposed model, the random variable of the retweet count at time is Poisson distributed with mean (and variance) which, according to (6), is


The second equality can be seen by substituting (7) into (6). As is N distributed apriori,

is log-normally distributed with mean

and variance . This enables us to obtain the expectation and variance of the expected retweet count in (16) with respect to :

These two quantities can be used to obtain the discrepancy, which is a goodness-of-fit measure advocated by Gelman, Meng and Stern (1996):


This discrepancy is directly computable using the samples of and from the MCMC outlined in Section 4. At each iteration, using the observed retweet counts as and the current values of , the actual discrepancy, denoted by , can be obtained. On the other hand, using the current values of and , we can simulate the retweet count, for the -th original, from the Poisson distribution with mean . Plugging the whole set of simulated retweet counts as and the current values of into (17) yields the simulated discrepancy, denoted by . Finally, comparing the two sets of discrepancies will help us determine if there are any inadequecies of the model fit. This can be achieved by plotting against

and computing the associated posterior predictive

-value, which is the empirical proportion of . Both of these will be presented in Section 6. For details of diagnostics for Bayesian hierarchical models in general, please see, for example, Gelman et al. (1996) and Section 2.2.2 of Cressie and Wikle (2011).

6 Application

Figure 7: Traceplots (left) and posterior densities (right) of the parameters of models 0 (salmon, solid) and 1 (turquoise, dashed on the right) fitted to #thehandmaidstale data.

Both the model-specific algorithms in Appendix B and the model selection algorithm in 4 are applied to the two data sets with different hashtags. For the #thehandmaidstale data, each of the three algorithms is applied to the times of creation of the 2043 originals, 265 of which have been retweeted at least once, and of their associated retweets, to obtain a single chain of 20000 iterations, upon thinning of 2000, after discarding the first 1000000 as burn-in. The individual model fits are reported in the form of traceplots and posterior densities of the parameters in Figure 7. While the inclusion of

in model 1 makes a substantial difference in terms of the posterior densities of the other parameters, what is more important is how the evidence of each model weighs against each other. In the model selection algorithm, the prior probabilities

and are chosen artifically to be and , respectively. They are chosen this way not to represent our prior belief in the models, but, to ensure sufficient mixing between the two states of in the chain. Model 0 is selected for 7584 times out of 20000 iterations, meaning that the Bayes factor in (15) is estimated to be . The RJMCMC algorithm gives a similar estimate of . So, for the #thehandmaidstale data set, which consists of tweets for over 21 hours, the generalised power law process hierarchical model is more appropriate. Such findings are different from the exponential cutoff phenomenon shown by Mathews et al. (2017) for tweets collected over a similar duration of 24 hours.

Figure 8: Traceplots (left) and posterior densities (right) of the parameters of models 0 (salmon, solid) and 1 (turquoise, dashed on the right) fitted to #gots7 data.

The #gots7 data set consists of 25420 originals, 3145 of which have been retweeted once, and 29751 retweets. To exploit parallelisation due tofacilitate the much greater computational burden, for each of the two model-specific algorithms and the model selection algorithm, 40 chains are obtained, each of which contains 500 iterations, upon thinning of 1000, after discarding the first 500000 as burn-in. The traceplots and the posterior densities for the parameters are plotted in Figure 8. The proximity of the posterior densities for all parameters other than is similar to that for the #thehandmaidstale data, suggesting that the inclusion of does not improve fit much. This is supported by the model selection results via GVS. With the prior probabilities and chosen to be and , respectively, model 0 is selected for 7344 times out of 20000 iterations in total, meaning that the Bayes factor is estimated to be . The RJMCMC algorithm gives a similar estimate of , meaning that model 0 is highly favoured. That the genearlised power law process hierarchical model is more appropriate for the #gots7 data set, which consists of tweets for about 4.4 hours, is also consistent with the findings by Mathews et al. (2017) for tweets collected over a similar duration of 3 hours.

Figure 9: Predicted against actual retweet count for #thehandmaidstale (left) and #gots7 data (right) with 95% prediction intervals (dotted, red) for the model selected by GVS. For multiple originals with the same actual retweet count, the widest prediction interval is shown. The blue dashed line is the line .
Figure 10: Simulated against actual discrepancies for #thehandmaidstale (left) and #gots7 (right) data. The blue dashed line is the line .

As outlined in Section 5, at each iteration of the MCMC, a value can be drawn from the Poisson distribution with mean as the simulated (or predicted) retweet count of the -th original . Across all iterations, the simulated values can be seen as a sample from the posterior predictive distribution of the random variable of the retweet count. As a preliminary way of examining the goodness-of-fit of our model, the posterior predictive mean of the retweet count, under the selected model 0 for both data sets, is plotted against the actual retweet count in Figure 9 with 95% predictive intervals. That some points seemingly form a vertical line is because there are multiple originals with the same actual retweet count. While small counts are slightly under-predicted, our model is doing a very good job for moderate to large counts for both data sets. For either data set, the alternative model 1 gives very close predictions of the retweet count, which are therefore not plotted, to those predicted by the selected model 0.

Utilising the simulated retweet counts, summaries of which are presented in Figure 9, the formal diagnostic procedures outlined in Section 5 are carried out for both data sets. The simulated discrepancies are plotted against the actual discrepancies in Figure 10. The hovering of the points around the line shows no apparent lack-of-fit issues. This is supported by calculating the posterior predictive -values, which are 0.424 and 0.434, for #thehandmaidstale and #gots7 data, respectively.

7 Simulation and backfitting

For each of the two data sets, we simulate the retweet times from the NHPP with intensity (5), based on the original times and the “mean-centred” follower counts . The posterior means of the parameters in the respective selected models (model 0 for both data sets) are used as their true values in the simulation. The cumulative retweet counts over time for the simulated data are plotted in Figure 11. Comparing with their counterparts in Figure 2, the simulated data show a high degree of similarity in the overall pattern of retweet growth. Furthermore, the frequencies of retweet counts are equally highly similar between the simulated data and the actual data, as shown in Figure 12. Both visualisations indicate that our proposed model is very capable of generating realistic retweet times that look like the observed data.

Figure 11: Cumulative retweet counts over time for simulated data corresponding to #thehandmaidstale (left) and #gots7 (right) data. The posterior means of the parameters under model 0 are used as the true values in the simulation. Each trajectory is of a different colour and represents the growth of (simulated) retweets of one individual original.
Figure 12: Frequencies of retweet counts for actual (black, circles) and simulated data (red, triangles) for #thehandmaidstale (left) and #gots7 (right) data.
Figure 13: Posterior density of the parameters (except ) of models 0 (salmon, solid) and 1 (turqoise, dashed) fitted to the simulated data shown on the left of Figure 11. The vertical lines represent the respective true values of the parameters.

Not only is the model able to produce realistic realisations, but its inference procedure is also working properly and can recover the true parameter values. Specifically, we applied the MWG algorithm for the individual models (Appendix B) to the simulated data shown on the left of Figure 11. For each parameter (other than

), its true value lies within the 95% credible interval of the corresponding marginal posterior distribution, as shown in Figure

13. Furthermore, the subsequent model selection overwhelming chooses model 0, which is the true model, over model 1. The Bayes factors according to RJMCMC and GVS are and , respectively.

8 Discussion

In this article we have proposed a Bayesian hierarchical model of generalised hybrid processes, which is shown to model well the retweets of the originals of a specific hashtag. The application to both #thehahndmaidstale and #gots7 data suggests that it is sufficient to fit a special case of the proposed model, which is the hierarchical model of generalised power law processes. Also, incorporating an original-specific scale allows us to explain retweet count by the follower count, which seems a natural candidate for driving retweet behaviour.

Whether the squared term should be included in can be examined by carrying out GVS or RJMCMCGibbs variable selection (GVS) for , in the same way it is for in our inference and application. In fact, as GVS was originally developed for determining whether covariates in a linear regression model should be included or not, we could include terms of higher powers in the linear predictor, and perform GVS for the associated parameters (and and ) simultaneously. However, as the overall fit is adequate while is excluded from the support of the marginal posterior of for both data sets, and as our focus is on the overall structure of the hierarchical model instead of the particular form of the linear predictor, we confine the GVS to only in this article.

While there are temporal fluctuations and bursty dynamics shown by, for example, Sanli and Lambiotte (2015), Mathiesen et al. (2013) Mollgaard and Mathiesen (2015), for their respective Twitter data sets, our results should not been seen as contradictory, for two reasons. First, their data were collected over several weeks, during which injections of interest due to external events were possible, while the data collection period was much shorter in both of our data sets, in which the generation of tweets and retweets was predominantly due to the broadcast of a TV series episode, the time of which was pre-specified. Second, our hierarchical model concerns the behaviour of information diffusion, in the form of retweets, which may be different from that of information generation, in the form of originals. Had the data been collected over a much longer time span, fitting the generalised hybrid process to the originals, as we did in Section 2, might not be appropriate anymore. Therefore, even though the generalised hybrid process may not be applicable to all kinds of online social media data of all timescales, it should still be useful to data regarding information diffusion within a time span of a single event with no apparent changepoints.

The parameters and are assumed to be common to all generalised hybrid processes of the originals. We argue that, for and , the between-original variation is already captured by the covariate together with , while for there is simply no extra information to assume otherwise. It might however be useful to incorporate a hierarchical structure to and simultaneously, to allow variation in the rate of decay of information diffusion between originals of the same hashtag. The parsimony of the proposed model also makes it easy to simulate retweets given originals and follower counts. Given the parameters, we can directly simulate , and subsequently the retweet times using (8), (7) and (5), respectively. If appropriate, we can go one step further by first simulating the originals from a more general point process, followed by the simulation of the retweets.

Another direction for further study is the generation of follower count. While both the follower count and the retweet count in our data sets empirically follow the power law, it is not used to characterise either of them. Instead of treating the former as a given covariate, it is possible to first draw the follower count from the power law, then draw retweet count given the follower count, via the implicit Poisson regression model.


Data supporting this publication is openly available under an ’Open Data Commons Open Database License’. Additional metadata are available at: http://dx.doi.org/10.17634/154300-57. Please contact Newcastle Research Data Service at rdm@ncl.ac.uk for access instructions. This research was funded by the Engineering and Physical Sciences Research Council (EPSRC) grant DERC: Digital Economy Research Centre (EP/M023001/1).


Appendix A Likelihood derivation

This appendix derives the likelihood of the hierarchical model for retweets in Section 3. We first consider the contribution of the -th original regardless of the model choice, or the value of equivalently. The contribution to the likelihood is

The general expression above is essentially the same as (2). The complete likelihood is


When , that is, and is removed from the parameter vector, substituting (5) and (6) into (18) yields

which is identical to (11) as . When , that is, , (18) becomes