Fast Estimation of Causal Interactions using Wold Processes

07/12/2018 ∙ by Flavio Figueiredo, et al. ∙ Universidade Federal de Minas Gerais 0

We here focus on the task of learning Granger causality matrices for multivariate point processes. In order to accomplish this task, our work is the first to explore the use of Wold processes. By doing so, we are able to develop asymptotically fast MCMC learning algorithms. With N being the total number of events and K the number of processes, our learning algorithm has a O(N( (N) + (K))) cost per iteration. This is much faster than the O(N^3 K^2) or O(K^3) for the state of the art. Our approach, called GrangerBusca, is validated on nine datasets from the Snap repository. This is an advance in relation to most prior efforts which focus mostly on subsets of the Memetracker dataset. Regarding accuracy, GrangerBusca is three times more accurate (in Precision@10) than the state of the art for the commonly explored subsets of the Memetracker data. Due to GrangerBusca's much lower training complexity, our approach is the only one able to train models for larger, full, sets of data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In order to understand complex systems we need to know how their components (or entities) interact with each other. Networks (or graphs) offer a common language to model such systems, where their entities are represented by nodes and their interactions by edges Barabasi2016 . The networked point process is a stochastic model for these systems, when data take the form of a time series of random events observed in each node. That is, in each node of a network we have a temporal point process, which is a random process whose realizations consist of the times of isolated events. Networked point processes are probabilistic models designed to analyze the influence that events occurring in a node may have on the events occurring on other nodes of the network.

Recently, several fields used networked point processes to understand complex systems such as spiking biological neurons 

Monti2014 , social networks Blundell2012 ; Silva2009 geo-sensor networks Hallac2017 , financial agents in markets Namaki2011 , television records Xu2016 and patient visits Choi2015 . One of the main objectives in these analyses is to uncover the causal relationships among the entities of the system, or the interaction structure among the nodes, which is also called the latent network structure. Typically, this is represented by a graph where edges connect nodes that affect each other and edge weights represent the intensity of this pairwise interaction. To the best of our knowledge, all methods that tackle this problem are based on Hawkes point process Hawkes1971A ; Hawkes1971B with a Granger causality framework Granger1969 imposed to retrieve the causal graph from data Achab2017 ; Xu2016 ; Zhou2013 ; Linderman2015 ; Linderman2014 . A point process is said to Granger cause another point process when the past information of can provide statistically significant information about the future occurrences of . We can thus encode causal relationships as a matrix Dhamala2008 ; Didelez2008 . In a multivariate point process, this notion of causality can be reduced to measuring if the conditional intensity function of is affected by the previous timestamps of  Xu2016 .

In contrast with the popular choice of using Hawkes process to model interacting processes, Wold processes Wold1948

have been neglected as a possible model. Wold processes are a class of point processes defined in terms of the joint distribution of inter-event times. Let

be the waiting time for -th event since the occurrence of the -th event. The main characteristic of Wold processes is that the collection of inter-event times

follows a Markov chain of finite memory. That is, different from Hawkes processes, whose intensity function depends on the whole history of previous events, the probability distribution of the

-th inter-event time depends only on the previous inter-event time . When Wold processes are able to mimic the dynamics of complex systems VazdeMelo2013 ; VazdeMelo2015

, this Markovian property can potentially boost the performance of learning algorithms as in our setting. Wold processes were proposed over sixty years ago, however, they remain scarcely explored in machine learning, which is unfortunate. As we will demonstrate in this paper, Wold processes can be both fast and accurate for some learning tasks.

We here present the first approach to tackle the discovery of latent network structures in point process data using Wold processes. Similar to recent efforts Achab2017 ; Xu2016 , our goal in this work is to extract Granger causality Granger1969 from multivariate point process data only. The main reason to consider the Wold process as an alternative to the Hawkes process is its Markovian structure. By adding Dirichlet priors over the mutual influences among the processes, we exploit the Markov property to develop learning algorithms that are asymptotically fast. We call our approach Granger-Busca. With being the number of observations in all processes and the number of processes, the state of the art approaches learn at a cost of  Xu2016 ( being defined by hyper-parameters) or  Achab2017 per iteration. Granger-Busca, in contrast, learns at a cost of .

Equally important, our results show significant improvements over the state of the art methods. For instance, when we measure the Precision@10 score between our Granger causal estimates and the ground-truth number of mentions of the commonly explored Memetracker data Leskovec2009 , our results are at least three times more accurate than the most recent and most accurate method Achab2017 .

In summary, our main contributions are: (1) We present the first approach to extract Granger causality matrices via Wold processes; (2) We develop asymptotically fast algorithms to learn such matrices; (3) We show how Granger-Busca is much faster and more accurate than state of the art methods, opening up the potential of Wold processes for the machine learning community.

2 Background and Related Work

A temporal point process is a probability model for a collection of times indexing the occurrences of random isolated events. Our context considers several point processes observed simultaneously, where each event is associated to a single point process: . Let be the union of all timestamps from all point processes with being the total number of events. We denote by the random cumulative count of the number of events up to (and including) time in process . The collection is an equivalent representation of the point process .

The conditional intensity rate function is the fundamental tool for modeling and making inferences on point processes. Let be the random history of the process up to, but not including, time , called the filtration. Similarly, is defined as the filtration considering the collection of all point processes. In the limit, as , the conditional intensity rate function is given by . The interpretation of this function is that, for a small time interval , the value of is approximately equal to the expected number of events in . It can also be interpreted as the probability that process has at least one event in the interval. The notation emphasizes that the conditional intensity at time depends on the random events that occurred previously.

The commonly explored Hawkes process is defined by its set of conditional intensity functions:


We can consider processes as being enumerated from . captures the exogenous (Poissonian) background arrival rate of process . captures the influence of the point process on . has to be integrable, non-negative, and with when . Usual forms of are the exponential and power-law functions Bacry2015 . With from Eq (1), there is evidence that Granger causes when there exists a time where  Xu2016 .

In contrast to the Hawkes process, a Wold process Daley2003 ; Wold1948 is defined in terms of a Markovian transition on the inter-event times (increments). Let be the stochastic process of time increments associated with point process . The Markovian transition between increments is established by the transition probability density which measures the likelihood of given the value of the previous increment  Wold1948 ; Daley2003 .

It is important to point out that, for most forms of , the model is analytically intractable Guttorp2012 . However, when has an exponential form, such as , the model has several interesting properties Cox1955 ; Daley1892

. In this particular form, the next increment is exponentially distributed with rate

where is the previous increment, i.e.: . For the particular case of , the work of Cox Cox1955 and Daley Daley1892 ; Daley2003 derive the stationary distribution of increments showing that it is of the form: .

Granger-Busca is motivated by recent efforts Alves2016 ; VazdeMelo2015 ; VazdeMelo2013 that employ variations of the exponential Wold process (defined above). Instead of defining the Wold process in terms of its interval exponential rate, such efforts defined the process in terms of the conditional mean

of an exponentially distributed random variable. This class of point processes is called

self-feeding processes (SFP). For the particular case of and , with being the Euler constant, VazdeMelo2015 ; VazdeMelo2013 showed that the stationary behavior can be very well approximated by the more tractable Log-Logistic distribution. This new form of specifying a Wold process is interesting because it is able to capture both exponential and power-law behavior, disparate behaviors simultaneously observed in real data Alves2016 ; VazdeMelo2015 ; VazdeMelo2013 . Realizations of this process tend to generate bursts of intense activity followed by long periods of silence.

Busca Alves2016 is another point process model based on Wold processes and it is Granger-Busca’s starting point. Starting from a SFP model with , the authors accommodate the less frequent long periods of waiting times observed in some real datasets, the process adds a background Poissonian rate (). The conditional intensity can thus be derived given by:


where with being defined by . That is, the index associated to the closest previous event to . is thus the previously observed increment.

Over the years, several Hawkes methods have been created for different applications with varying asymptotic costs Bao2017 ; Blundell2012 ; Chen2017 ; Chevallier2015 ; Li2017 ; Rizoiu2017 . We now discuss the methods most related to Granger-Busca. Achab et at. Achab2017 presents a approach. Xu et al. Xu2016 learns kernels at cost of . Here, represents the number of basis functions used to approximate the kernel no-parametrically. Similarly Yang et al. Yang2017 discusses a non-parametric Hawkes that does not explore the infinite memory. The authors show that after events, the kernel is adequately estimated. However, the proposed method still scales in the order of 111The authors discuss a cost for a parallel algorithm with hyper-parameters being . While this is correct asymptotically, we compare, for all methods, the non-parallel cost with hyper-parameters. In practice, hyper-parameters have a multiplicative effect on learning time. Moreover, for every parallel method (Granger-Busca included), the reduction factor of ( from ) is only possible via unbounded parallelization (one process per CPU), being unfeasible for large data: ., represents again a pre-defined number of functions used to approximate the kernel, is the maximum number of previous events to be considered. A similar proposal is presented by Etesami et al. Etesami2016 . In contrast, Granger-Busca has a computation complexity of

per iteration. We achieve this low cost by employing a Bayesian inference where at each step the algorithm learns, for each event in

, which other process, if any, caused that event. Past efforts have employed similar sampling approaches on Hawkes based models Du2015 ; Linderman2014 ; Linderman2015 which again do not scale.

It is important to point out that Isham Isham1977 was one of the first and few to discuss multivariate Wold processes. However, the author was mostly focused on analytical properties of the Multivariate Wold Process on a very specific setting (an infinite process defined on the unit circle).

3 Model

(a) Granger-Busca’s clustering behavior
(b) Granger-Busca’s usage of increments
(c) Rate and Number of Events
Figure 1: Granger-Busca at work. Plot (a) shows the events of process (circles) and process (triangles). The arrows show the excitement component of the model. Plot (b) illustrates how and are calculated. Plot (c) shows the cumulative random processes and in the top, while the bottom plot shows the random conditional intensity functions and .

We define Granger-Busca using Figure 1, which exemplifies the behavior of the model with two processes. In (a), we show the events of each process as a horizontal line of symbols, triangles in the upper row for process , and circles in the bottom row for process . The unfilled symbols represent events that are caused in an exogenous way, not triggered by any other event. We shall detail how to label events in Section 4. The filled symbols are events that appear as a causal influence from some other previous event. We say that these events have been excited by the previous occurrence of other events. The directed edge connects the origin event to the resulting event. Edges crossing the two parallel lines of events represent the cross-excitement component. In this case, one event in a given process stimulates the occurrence of events in another process. Another situation is due to self-excitement, when events in a given process stimulate further events in the same process. These are represented by the horizontal arrows in Figure 1(a). Figure 1(c) shows the behavior of the random and the random conditional intensity function . Notice that behaves like a step function. That is, the rate of arrivals is fixed until the next arrival. The figure also shows the burst-idleness cycle observed with Granger-Busca, as well as the cross-excitation.

To formalize Granger-Busca, let us first redefine as a function . Recall that is the index of the previously observed event in that is closest to . Also recall that is the last observed increment. The function notation will simplify our extension to a multivariate Wold process. See Figure 1(b) for an illustration. Let us now define define as the event in that is closest to , that is . With such an index, we can denote by the difference between and , i.e.: When the expected values of are small, events in usually precede . In this sense, one intuition on how Granger-Busca works is that larger observed values of lead to weaker evidence for the influence of process on process .

The above behavior motivates Granger-Busca’s multivariate conditional intensity function:


The random intensity function is the sum of two components. The first one is and it represents the exogenous events in process that are generated at a Poissonian constant rate by unit of time. The other component is a sum over all processes, including the same process. The terms in this sum represent the increment on the baseline rate contributed by other previous events from itself (self-excitement) or from other processes (cross-excitement). Based on a very sparse representation, the entire history is concentrated only on the most recent time gaps between events. Hence, the process influence on process is represented by the ratio . The numerator is a scale factor measuring the amount of cross-excitement: when is equal to zero, there is no influence from on . The denominator models how this cross-excitement takes place. At time , we add the time gap to the flat value . A large gap makes the contribution of the ratio small relative to the baseline rate . Otherwise, a small gap raises this contribution up to its maximum possible contribution rate of .

Model parameters and definitions: contains the full set of model parameters. is the Granger matrix of the model. We require that and that . Hence, the value in the is proportional to the fraction of events from process that triggered events on . By definition, is row stochastic Horn1990

. This property leads to several interesting characteristics of the model, that we further develop throughout the rest of this section. The vector

captures the the overall influence strength for each process . That is, when a process influences another, it does so by exponentially distributed inter-event times with a mean of . As we now show, to guarantee stationary conditions, it is necessary that . Finally, we consider that each event has a latent label indicating either that it is exogenous or which process caused it (edges in Figure 1(a)). Estimation of would be trivial if these labels were known. Thus, our learning algorithm (see Section 4) focuses on estimating on such labels from data.

As pointed out, Granger-Busca’s stationarity (some authors also call this property stability Daley2003 ; Linderman2014 ) depends on . Let be the diagonal matrix where each diagonal cell is equal to . Moreover, let . Each value in is the cross-intensity for a pair of processes. If are represented in the rows and in the columns, the expected number of events at an infinitesimal region for is equal to the row sum of this matrix. This value is simply Granger-Busca’s cross-feeding intensity without the exogenous factor. Now, let be the induced -norm (maximum row sum).

Definition 1 (Stationarity): Notice that,  Horn1990 . The proof for this property is straightforward and has interesting implications. As is row-stochastic, we have . Also, since and . The matrix will either scale down this norm or keep it unchanged. The spectral radius is . Also, .

Due to the above definitions, the model is stationary/stable as it will never generate infinite offspring. In fact, the total number of offspring at any given time is determined by the sum . Next, we explore the definition of Granger causality for point processes Didelez2008 ; Xu2016 .

Definition 2 (Granger Causality): A process is said to be independent of any other process when: for any . In contrast, is defined to Granger cause when: , where . As a consequence, given two processes and whose intensity functions follow Eq (3), Granger causality arises when 222When interpreting results from Granger-Busca, we relax the above condition of strict independence to . In such cases, we can state that we have no statistical evidence for Granger causality.

4 Learning Granger-Busca

Recall from Figure 1 that Granger-Busca exhibits a cluster like behavior. That is, exogenous observations arrive at a fixed rate, with each observation being able to trigger new observations leading to a burst/idleness cycle. Based on this remark, we developed our Markov Chain Monte Carlo (MCMC) sampling algorithm to learn Granger-Busca from data. Our algorithm will work by sampling, for each observation , the hidden process (or label), that likely caused this observation. In other words, we sample a latent variable, , which takes a value of when process influences . When the stamp is exogenous, we set this value to a constant . Such a number merely represents a label (exogenous) and does not affect our sampling.

To simplify our learning strategy, we decided to fix . We notice that such a value shown to be sufficient for Granger-Busca to outperform state of the art baselines (see Section 5). Later in this section, we shall discuss that our learning algorithm is easily adaptable for general forms of the intensity function. is estimated based on the number of events that caused on , whereas is estimated as the maximum likelihood rate for a Poisson process based on the exogenous events for . We can thus learn Granger-Busca

with an Expectation Maximization approach. Hidden labels are estimated in the Expectation step. The matrix

can also be readily updated in such a step. With the labels, estimated in the maximization step. Next, we discuss our fitting strategy.

Initially, we explored the Markovian nature of the process to evaluate at a cost. Given some labels’ assignments for the events, we obtain with a binary search over and . We explain now how we update the labels. Given any event at in process , the event either exogenous or induced by some other previous event on or from some other process . By the superposition theorem Daley2003 ; Linderman2014 ; Linderman2015 ; Du2015 , we obtain the conditional probability that an individual event at is exogenous or was caused by process (where can be equal to for SFP like behavior) as:


where the operator indicates causality. Notice that, is equivalent to . Eq (4) is carried out conditionally on the history .

We can accelerate substantially our evaluations by using a binary modified Fenwick Tree Fenwick1994 (the F+Tree Yu2015 ) data structure if we break the label assignment into two steps. First, we decide if it is exogenous. Given it is not, we select the inducing process based on the conditional probability:


The evaluation of the probability that an event is not exogenous has an cost because we estimate where is the last event before that arrived from an exogenous factor333The operator means equality by definition.. This probability is the complement of the probability that zero Poissonian events happened between and . As is row stochastic, we first add a Dirichlet prior over each row of this matrix (). We finally sample the hidden labels as follows:

For each process

  1. Sample row from as

For each process

  1. For each observation

    1. Sample

      1. When

      2. Otherwise

Model parameters are estimated through a MCMC sampler. Starting at an arbitrary random state (i.e., labels’ assignment), let be the number of times influenced . Similarly, captures the number of times influenced any process, including itself. The conditional probability of hidden labels for every observation, , is given by: . By factoring , labels are sampled based on the collapsed Gibbs sampler Steyvers2007 . Thus, we re-write Eq (5) below. In this next equation, we point out the Proposal and Target Distributions used on the Metropolis Based Sampler (discussed next).


where being the current estimate of , with being the count for the pair excluding the current assignment for and being similarly defined. Thus, the full algorithm follows an EM approach. After labels are assigned in the Expectation step, we can compute for every process by simply estimating the maximum likelihood Poissonian rate. Given that it takes time to compute and time to compute Eq 5, the total sampling complexity per learning iteration for the Gibbs sampler will be of .

Speeding Up with a Metropolis Based Sampler: The factor in the traditional Gibbs sampler may be reduced by exploiting specific data structures, such as the AliasTable Yuan2015 ; Li2014 or the F+Tree Yu2015 . In order to speed-up Granger-Busca, we shall employ the latter (F+Tree). The trade-offs between the two are discussed in Yu2015 . Our choice is motivated by the fact that the F+Tree does not make use of stale samples. Thus, we can perform multinomial sampling with a cost. Given that the AliasTable cannot be updated quickly, it is usually only suitable at later iterations Yuan2015 ; Li2014 .

We exploit the F+Tree by changing our sampler to a Metropolis Hasting (MH) approach. Using the common notation for an MH, let

be the proposal probability density function as detailed in Eq 

6. Here, is a candidate process which may replace the current assignment . When the target distribution function is simply Eq 6, i.e., , we can sample the assignment using the acceptance probability In other words, at each iteration we either keep the previous or replace with according to the acceptance probability. As discussed, with the F+Tree, we can sample from in time. We can also update the tree with the new probabilities after each step with the same cost. Given that F+Tree has a cost to build, we build the tree once per process. Finally, as required for the Metropolis Hasting algorithm, it is trivial to see that the proposal distribution is proportional to the target distribution Hastings1970 .

Parallel Sampler: With the F+Tree, candidates are sampled at a ) cost per event. Moreover, finding previous stamp costs . By adding these two costs, the algorithmic complexity of Granger-Busca per iteration is . Finally, notice that the sampler is easy to parallelize. That is, by iterating over events on a per-process basis, we parallelize the algorithm by scheduling different processes to different CPU cores. Overall, only vector of variables is shared across processes ( in Eq (6)). In our implementation, each core will read for each process before an iteration. After the iteration, the value is thus updated globally. Our sampler falls in the case of being Asynchronous with Shared Memory as discussed in Terenin2017 .

Learning different Kernels Consider the equivalent rewrite of , where, for Granger-Busca in particular. With this new form, the model acts as a mixture of intensities (). Each pairwise intensity is weighted by the causal parameters . Now, notice that using this form our EM algorithm is easily extensible for different functions . The E-step is able to estimate the causal graph (considering that Eq (6) ). The M-Step provides maximum likelihood estimates for the specific parameters appearing in . In fact, even unsupervised forms may be learned. As we discuss in the next section, we keep the aforementioned intensity given that it is simpler and outperforms baselines in our datasets.

5 Results and Experiments

We now present our main results. Granger-Busca is compared with three recently proposed baselines methods: ADM4 Zhou2013 , Hawkes-Granger Xu2016 and Hawkes-Cumulants Achab2017 . The code for each method is publicly available via the library tick444 Results produced with version Experiments were performed on a dedicated Azure Standard DS15 v2 instance with 20 Intel Xeon CPU E5-2673 v3 cores and 140GB of memory. We point out that we perform comparisons are performed with Hawkes methods given that it is the most prominent framework. There is no Wold method for our task. We compare with approaches that are: parametric Zhou2013 and non-parametric Xu2016 ; Achab2017 , and explore finite Achab2017 and infinite Zhou2013 ; Xu2016 memory.

Hyper Parameters: ADM4 enforces an exponential kernel and with a fixed rate. We employ a Tree-structured Parzen Estimator to learn such a rate Bergstra2011 , optimizing for the best model in terms of log-likelihood. For Hawkes-Granger, we fit the model with basis functions as suggested in Achab2017 . Finally, Hawkes-Cumulants Achab2017 also has a single hyper parameter called the half-width, which was also estimated using Bergstra2011 . Training is performed until convergence or until 300 iterations is reached. Our MCMC sampler executes for 300 iterations with and .

Datasets: We evaluate Granger-Busca and the aforementioned three baselines on 9 different datasets, all of which were gathered from the Snap Network Repository555 Out of the nine datasets, we pay particular attention to the Memetracker data, which is the only one used to evaluate all past efforts. The Memetracker dataset consists of web-domains linking to one another. We pre-process the Memetracker dataset using the code made available by Achab2017 . The other eight datasets consist of source nodes (e.g., students or blogs) sending events to destination nodes (e.g., messages or citations). Each datasets can be represented as triples . The ground truth network is defined as the graph , where the vertices are both the sources and the destinations. Edges, and , represent the relationship between two entities. The weighted adjacency matrix of this graph, , is our ground-truth. It is defined as: To compose each process from these datasets, each destination node represents a process. In compliance to our notation, we call such processes . Notice that we do not consider source nodes, , as processes. Thus, the models will aim to extract causal graphs based on the likelihood that reception of messages at a destination will trigger incoming messages. If this is the case, we have evidence that is also be a source node. Finally, we pre-process the data to only consider destinations that have also sent messages.

# Proc (K) # Obs. (N) N (Top-100) Span %NZ P@5 P@10 P@20 TT(s)
bitcoinalpha Kumar2016 3,257 23,399 2,279 5Y 0.2% 0.26 0.14 0.07 3
bitcoinotc Kumar2016 4,791 33,766 2,328 5Y 0.1% 0.25 0.14 0.07 7
college-msg Panzarasa2009 1,313 58,486 10,869 193D 1.1% 0.36 0.30 0.19 1
email Leskovec2007 ; Yin2017 803 327,677 92,924 803D 3.74% 0.23 0.28 0.32 4
sx-askubuntu Paranjape2017 88,549 879,121 58,142 7Y 0.006% 0.25 0.13 0.06 2774
sx-mathoverflow Paranjape2017 16,936 488,984 59,602 7Y 0.07% 0.28 0.16 0.09 98
sx-superuser Paranjape2017 114,623 1,360,974 64,866 7Y 0.006% 0.26 0.14 0.07 4614
wikitalk Leskovec2010b ; Paranjape2017 251,154 7,833,140 211,344 6Y 0.003% 0.25 0.14 0.07 27540
memetracker-100 Leskovec2009 100 24,665,418 - 9M 9.85% 0.30 0.29 0.22 114
memetracker-500 Leskovec2009 500 39,318,989 - 9M 4.44% 0.30 0.30 0.23 274
Table 1: Datasets used for Experiments and Precision Scores for Full Datasets. Due to their sizes, only Granger-Busca is able to execute in all datasets. To allow comparisons, we execute baselines methods with only the Top-100 destination nodes. Other results are presented in Table 2 and Figure 2.

Metrics: We evaluate Granger-Busca and its competitors using the Precision@n, Kendall Correlation and Relative Error Scores. Each score is measured per node (or row of

), and is summarized for the entire dataset as the average over every node. Both the Kendall Correlation, as well as the Relative Errors, were proposed as evaluation metrics for networked point processes in 

Achab2017 . Precision@n captures the average fraction of edges in out of the top edges ordered by weight which are also present in . As we shall discuss, there are several limitations with the Kendall and Relative Error scores due to graph sparsity. We argue that Precision@n measured at different cut-offs () is a more reliable evaluation metric for large and sparse graphs, as the ones we explore here.

Results: Table 1 describes the datasets used in this work presenting their number of nodes and of observations. Most baselines do not execute on large datasets in under 24h of training time (TT), in contrast with Granger-Busca. Given the asymptotic complexity, we estimate that some models may take months to execute. Hence, to allow comparisons with Granger-Busca, we considered subsamples containing only the events involving the Top-100 destinations. We pay particular attention to Top-100 and 500 nodes for Memetracker, given that it was explored in prior efforts Zhou2013 ; Xu2016 ; Achab2017 .

Table 1 also presents the Precision@n scores for the Granger-Busca. Recall that ours is the only approach able to execute on the full sets of data. Firstly, notice that the Kendall and Relative Error scores are absent from Table 1. Given that datasets are sparse – as shown by the fraction of non-zeros in the ground truth, or %NZ, in Table 1 – the Kendall Correlations and Relative Errors are unreliable metrics for large networks. It is well known that complex networks as ours have long-tailed distributions for the edge-weights Barabasi2016 , leading to the high sparsity levels (%NZ) seen in Table 1. With most cells being zeros, Kendall Scores also tend to zero as most sources connect to few destinations. Similarly, the relative errors will likely increase. In order to avoid divisions by zero, previous efforts impose a constant penalization, of one, when zero edges exist between two nodes. Similar to the Kendall Correlation, this penalization will also dominate the score due to the sparsity.

Precision@5 Precision@10 Precision@20 Kendall Rel. Error TT(s)
top-100 0.06 0.30 0.09 0.29 0.01 0.22 0.05 0.26 1.0 0.44 87 114
top-500 0.01 0.30 0.01 0.30 0.02 0.23 0.08 0.20 1.8 0.06 715 274
Table 2: Comparing Granger-Busca (GB) with Hawkes-Cumulants (HC) Memetracker.

Because of the above limitations of prior efforts and metrics, we are unable to present a fair comparison with competitors on the full datasets. To achieve this goal, in Table 2, we present the overall scores for Granger-Busca and the Hawkes-Cumulants (HC) Achab2017 approach, focusing only on the Memetracker data. In this setting, Hawkes-Cumulants has already been shown to be more accurate and faster than ADM4 Zhou2013 and Granger-Hawkes Xu2016 (GH). Observe that Granger-Busca is more accurate than the competing method in every score. Indeed, Precision@n scores are at least three times greater depending on the cut-off (Precision@5, 10 and 20). Kendall Scores also show substantial gain (250%), with Granger-Busca achieving 0.20 and HC achieving 0.08 correlations on average. Relative errors for Granger-Busca are also 56% lower than HC (1.0 versus 0.44). Finally, notice how Granger-Busca is slightly slower than HC when fewer nodes are present (100), but significantly surpasses HC in speed as increases. This comes from the cubic cost incurred by HC.

Figure 2: Precision Scores for the Top-100 datasets.

To present an overall view of Granger-Busca against all three competing methods (ADM4, HC and HG), in Figure 2 we show Precision@5, 10 and 20 scores for each approach on every Top-100 dataset. A total of 26 (out of 27 possible) points are plotted in the figure. One single setting, HC with Memetracker, is absent since the model was unable o train sufficient time. The -axis of the figure presents the Precision@n score for the baseline. Similarly, the -axis shows the Precision@n score for Granger-Busca. We also show three regions where either Granger-Busca or competitors perform worse than a Null model. This model was created by performing uniformly random rankings. Notice that for Precision@5 and Precision@10, Granger-Busca outperforms most baselines on a large fraction of the datasets. In fact, for Precision@5, there is only one setting where ADM4 outperforms Granger-Busca. As the precision cut-off increases, so does the efficacy of the Null model (i.e., it easier for a random ranking to recover top edges). For Precision@20, there does exists some cases where Granger-Busca is outperformed by baseline methods. However, the majority of these cases exist in the region where both models are below the efficacy of a Null model.

Why does the model work? Recall that a Wold process is an adequate model when there is a strong dependency between consecutive inter-event times and . To explain our results, we measured the correlation between and for pairs of interacting processes from the ground-truth data. Out of nine datasets, the worst case had the median Pearson correlation per pair equal to 0.3, a moderate value. However, in the remaining eight datasets this median was above 0.70, indicating the adequacy of a Wold model. The high linear dependency implies that . Thus, is a linear function, , of the previous inter-event, justifying the intensity: (see Section 4 for a discussion on how to learn other forms).

It is also important to understand the limitations of non-parametric methods such as HC. Recall that HC relies on the statistical moments (e.g., mean/variance) of the inter-event times 

Achab2017 . Given that web datasets exhibit long tails (with theoretical distributions exhibiting high, or even infinite, variance), such moments will not accurately capture the underlying behavior of the dataset (see Section 2).

6 Conclusions and Future Work

In this work, we present the first method to extract Granger causality matrices via Wold Processes. Though it was proposed over sixty years ago, this framework of point processes remain largely unexplored by the machine learning community. Our approach, called Granger-Busca, outperforms recent baseline methods Achab2017 ; Xu2016 ; Zhou2013 both in training time and in overall accuracy. Granger-Busca works particularly well when extracting the top connections of a node (Precision@5, 10, 20).

Given the efficacy of Granger-Busca, our hope is that current results may open up a new class of models, Wold processes, to be explored by the machine learning community. Finally, Granger-Busca can be used to explore real world behavior in complex large scale datasets.


We thank Fabricio Murai and the anonymous reviewers for providing comments. We also thank Gabriel Coutinho for discussions on the mathematical properties of Granger-Busca

, as well as Alexandre Souza for providing pointers to prior studies. This work has been partially supported by the project ATMOSPHERE (, funded by the Brazilian Ministry of Science, Technology and Innovation (Project 51119 - MCTI/RNP 4th Coordinated Call) and by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement no 777154. Funding was also provided by the authors’ individual grants from CNPq, CAPES and Fapemig. Computational resources were provided by the Microsoft Azure for Data Science Research Award (CRM:0740801).


  • [1] M. Achab, E. Bacry, S. Gaïffas, I. Mastromatteo, and J.-F. Muzy. Uncovering causality from multivariate hawkes integrated cumulants. In ICML, 2017.
  • [2] R. Alves, R. Assunção, and P. O. S. Vaz de Melo. Burstiness scale: A parsimonious model for characterizing random series of events. In KDD, 2016.
  • [3] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau. Operating Systems: Three Easy Pieces. Arpaci-Dusseau Books, 0.91 edition, 2015.
  • [4] E. Bacry, I. Mastromatteo, and J.-F. Muzy. Hawkes processes in finance. Market Microstructure and Liquidity, 1(01), 2015.
  • [5] Y. Bao, Z. Kuang, P. Peissig, D. Page, and R. Willett. Hawkes process modeling of adverse drug reactions with longitudinal observational data. In ML for HC, 2017.
  • [6] A.-L. Barabási and M. Pósfai. Network science. Cambridge University Press, Cambridge, 2016.
  • [7] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In NIPS, 2011.
  • [8] C. Blundell, J. Beck, and K. A. Heller. Modeling reciprocating relationships with hawkes processes. In NIPS, 2012.
  • [9] S. Chen, A. Shojaie, E. Shea-Brown, and D. Witten. The multivariate hawkes process in high dimensions: Beyond mutual excitation. arXiv preprint arXiv:1707.04928, 2017.
  • [10] J. Chevallier, M. J. Cáceres, M. Doumic, and P. Reynaud-Bouret. Microscopic approach of a time elapsed neural model. Mathematical Models and Methods in Applied Sciences, 25(14), 2015.
  • [11] E. Choi, N. Du, R. Chen, L. Song, and J. Sun. Constructing disease network and temporal progression model via context-sensitive hawkes process. In ICDM, 2015.
  • [12] D. R. Cox. Some statistical methods connected with series of events. Journal of the Royal Statistical Society. Series B (Methodological), 1955.
  • [13] D. Daley. Stationary point processes by markov-dependent intervals and infinite intensity. Journal of Applied Probability, 19(A), 1982.
  • [14] D. J. Daley and D. Vere-Jones. An introduction to the theory of point processes, vol. 1. Springer, New York, 2003.
  • [15] M. Dhamala, G. Rangarajan, and M. Ding. Estimating granger causality from fourier and wavelet transforms of time series data. Physical review letters, 100(1), 2008.
  • [16] V. Didelez. Graphical models for marked point processes based on local independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 2008.
  • [17] N. Du, M. Farajtabar, A. Ahmed, A. J. Smola, and L. Song. Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In KDD, 2015.
  • [18] J. Etesami, N. Kiyavash, K. Zhang, and K. Singhal. Learning network of multivariate hawkes processes: A time series approach. In UAI, 2016.
  • [19] P. M. Fenwick. A new data structure for cumulative frequency tables. Software: Practice and Experience, 24(3), 1994.
  • [20] C. W. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, 1969.
  • [21] P. Guttorp and T. L. Thorarinsdottir. What happened to discrete chaos, the quenouille process, and the sharp markov property? some history of stochastic point processes. International Statistical Review, 80(2), 2012.
  • [22] D. Hallac, Y. Park, S. Boyd, and J. Leskovec. Network inference via the time-varying graphical lasso. In KDD, 2017.
  • [23] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970.
  • [24] A. G. Hawkes. Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society. Series B (Methodological), 1971.
  • [25] A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1), 1971.
  • [26] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 1990.
  • [27] V. Isham. A markov construction for a multidimensional point process. Journal of Applied Probability, 14(3), 1977.
  • [28] S. Kumar, F. Spezzano, V. Subrahmanian, and C. Faloutsos. Edge weight prediction in weighted signed networks. In ICDM, 2016.
  • [29] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In KDD, 2009.
  • [30] J. Leskovec, D. P. Huttenlocher, and J. M. Kleinberg. Governance in social media: A case study of the wikipedia promotion process. In ICWSM, 2010.
  • [31] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007.
  • [32] A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In KDD, 2014.
  • [33] S. Li, X. Gao, W. Bao, and G. Chen. Fm-hawkes: A hawkes process based approach for modeling online activity correlations. In CIKM, 2017.
  • [34] S. Linderman and R. Adams. Discovering latent network structure in point process data. In ICML, 2014.
  • [35] S. Linderman and R. Adams. Scalable bayesian inference for excitatory point process networks. arXiv preprint arXiv:1507.03228, 2015.
  • [36] R. P. Monti, P. Hellyer, D. Sharp, R. Leech, C. Anagnostopoulos, and G. Montana. Estimating time-varying brain connectivity networks from functional mri time series. NeuroImage, 103, 2014.
  • [37] A. Namaki, A. Shirazi, R. Raei, and G. Jafari. Network analysis of a financial market based on genuine correlation and threshold method. Physica A: Statistical Mechanics and its Applications, 390(21), 2011.
  • [38] Y. Ogata. On lewis’ simulation method for point processes. IEEE Transactions on Information Theory, 27(1):23–31, 1981.
  • [39] P. Panzarasa, T. Opsahl, and K. M. Carley. Patterns and dynamics of users’ behavior and interaction: Network analysis of an online community. Journal of the Association for Information Science and Technology, 60(5), 2009.
  • [40] A. Paranjape, A. R. Benson, and J. Leskovec. Motifs in temporal networks. In WSDM, 2017.
  • [41] M.-A. Rizoiu, Y. Lee, S. Mishra, and L. Xie. Hawkes processes for events in social media. In S.-F. Chang, editor, Frontiers of Multimedia Research. ACM, 2018.
  • [42] J. Silva and R. Willett.

    Hypergraph-based anomaly detection of high-dimensional co-occurrences.

    IEEE transactions on pattern analysis and machine intelligence, 31(3), 2009.
  • [43] M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of latent semantic analysis, 427(7), 2007.
  • [44] A. Terenin and E. P. Xing. Techniques for proving asynchronous convergence results for markov chain monte carlo methods. In NIPS, 2017.
  • [45] P. O. S. Vaz de Melo, C. Faloutsos, R. Assunção, R. Alves, and A. A. Loureiro. Universal and distinct properties of communication dynamics: how to generate realistic inter-event times. ACM Transactions on Knowledge Discovery from Data (TKDD), 9(3), 2015.
  • [46] P. O. S. Vaz de Melo, C. Faloutsos, R. Assunção, and A. Loureiro. The self-feeding process: A unifying model for communication dynamics in the web. In WWW, 2013.
  • [47] H. Wold. On stationary point processes and markov chains. Scandinavian Actuarial Journal, 1948(1-2), 1948.
  • [48] H. Xu, M. Farajtabar, and H. Zha. Learning granger causality for hawkes processes. In ICML, 2016.
  • [49] Y. Yang, J. Etesami, N. He, and N. Kiyavash. Online learning for multivariate hawkes processes. In NIPS, 2017.
  • [50] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich. Local higher-order graph clustering. In KDD, 2017.
  • [51] H.-F. Yu, C.-J. Hsieh, H. Yun, S. Vishwanathan, and I. S. Dhillon. A scalable asynchronous distributed algorithm for topic modeling. In WWW, 2015.
  • [52] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma. Lightlda: Big topic models on modest computer clusters. In WWW, 2015.
  • [53] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In AISTATS, 2013.

Appendix A Simulating Granger-Busca

  Input: max time , num proc , , ,
  Output: observations
  while  do
     {Compute the rate for each process using , , . The rate depends of and to compute (see Eq (3)). If such timestamps do not exist, fall back to a Poisson process, that is: }
     for  to  do
     end for
     {Move forward in time. That is, sample a new observation with rate }
     {Sample a process to such that }
     while  do
        if  then
        end if
     end while
  end while
Algorithm 1 Ogata’s Thinning Algorithm Adapted for Granger-Busca

In Algorithm 1 we show how Ogata’s Modified Thinning algorithm [38] is adapted for Granger-Busca. We initially point out that some care has to be taken for the initial simulated timestamps. Given that (the previous observation) does not exist, the algorithm will need to either start with a synthetic initial increment of fall back to the Poisson rate. In the algorithm, the rate of each individual process is computed. Then, a new observation is generated based on the sum of such rates. Given that each process will behave like a Poisson process while a new event does not surface (Figure 1), the sum of these processes is also a Poisson process. Lastly, we employ a multinomial sampling to determine the process from which the observation belongs to.

Appendix B Log Likelihood

We now derive the log likelihood of Granger-Busca for parameters . For a point process with intensity , the likelihood can be computed as [14]:


Considering the intensity of each process separately, we can write the log likelihood as:


Here, is the last event from . The expansion of the integral comes from the stepwise nature of (see Figure 1). For simplicity, let us initially assume that there is only one process. As discussed in the paper, computing has a cost. Due to summations of the form, , the cost to evaluate is . is the cost to evaluate for every observation.

Now, let us return to the case of multiple processes. Let be the number of events for process . Next, is the number of events in the processes with the most of such a number. That is, . The cost of , naively, will be of . This comes from the summation: . To simplify the comparison with past methods, in our manuscript we did not detail our runtime cost in terms of . Strictly speaking, our fitting algorithm with the MCMC sampler performs at a cost of: .

Appendix C Fitting Algorithm

  Input: all observations , prior , num. iter
  Output: ,
  {Sample initial state from a random uniform . The value is reserved to indicate exogeneous events. returns .}
  for  to  do
     for  to  do
     end for
  end for
  {Sample hidden labels}
  for  to  do
     for  to  do
     end for
  end for
  {Compute Output. }
  return ,
Algorithm 2 Sampling Granger-Busca
  Input: observations , current state , prior , num proc. , exogeneous rate
  {The tree is populated with the probability that each process can cause . i.e., }
  for  to  do
     if not  then
     end if
     if not  then
        {See Eq (6) for the proposal and target }
        if  then
        end if
     end if
  end for
Algorithm 3 Expectation Step ()

The algorithm is shown in Algorithm 2, with the E-step being detailed in Algorithm 3. The maximization step, for Granger-Busca in particular, is a MLE estimation for a Poisson process. The pseudo-code shown here is not parallel and builds the F+Tree naively. By updating using a sloppy counter (see Chapter 11 of [3]) across processing cores, one only needs to iterate over to compute . The counter consists of a local count of for each processor. After a certain number of steps, say at every -iterations, is synced with a master parameter server.

The runtime of the algorithm may be optimized by either pre-computing or caching for every observation from every process. Nevertheless, this pre-computation comes at a memory cost of being likely is prohibitive for larger datasets. We can however cache a small subset of such values to amortize the cost down to for cache hits. Secondly, the cost can also be amortized with an AliasTable. With these two optimizations, it is possible to implement optimized versions of the sampling algorithm that execute at a amortized cost per iteration.