Efficient Non-parametric Bayesian Hawkes Processes

10/08/2018 ∙ by Rui Zhang, et al. ∙ 6

In this paper, we develop a non-parametric Bayesian estimation of Hawkes process kernel functions. Our method is based on the cluster representation of Hawkes processes. We sample random branching structures, and thus split the Hawkes process into clusters of Poisson processes, where the intensity function of each of these processes is the nonparametric triggering kernel of the Hawkes process. We derive both a block Gibbs sampler and a maximum a posteriori estimator based on stochastic expectation maximization. On synthetic data, we show our method to be flexible and scalable, and on two largescale Twitter diffusion datasets, we show our method to outperform the parametric Hawkes model. We observe that the learned non-parametric kernel reflects the longevity of different content types.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

Efficient-Nonparametric-Bayesian-Hawkes-Processes

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Poisson Cluster Process

(b) Branching Structure
Figure 3: Cluster Representation of the Hawkes Process. (a) A Hawkes process with decaying triggering kernel has intensity which increases after each new point is generated. It can be represented as a cluster of Poisson processes PP() associated with each . For the triggering relationships shown in (a), a branching structure can be represented as (b), where represents triggers and

is the probability of

being generated by .

The Hawkes process (Hawkes, 1971) is a useful model to describe self-exciting point data in which every point increases the likelihood of arrival of new points. More specifically, every point causes the conditional intensity function which modulates the arrival rate of new points to increase. An alternative understanding of the Hawkes process is the Poisson cluster process (Hawkes and Oakes, 1974) which categorizes points into immigrants and offspring. Immigrant points are generated independently at a rate of and offspring points are triggered by existing points at a rate of . Points are structured into clusters, associated with each immigrant point, which is called the branching structure (an example shown in Fig. 3).

Background & Motivations

The Hawkes process has been applied to a wide range of problems. Mishra et al. (2016) employ the branching factor of the Hawkes process to predict popularity of tweets; Kurashima et al. (2018) predict human actions by the Hawkes process; information reliability and information source trustworthiness of online knowledge repositories are evaluated by the Hawkes process (Tabibian et al., 2017). Here, we address two difficulties of leveraging Hawkes processes in such applications: (1) the aforementioned work employ manually designed parametric Hawkes models which may not generalize well and (2) in particular cases, the shape of the Hawkes kernel might not be easy to deduce in advance.

A typical solution is the non-parametric estimation of the intensity function (Lewis and Mohler, 2011; Zhou et al., 2013; Eichler et al., 2017; Bacry et al., 2012; Bacry and Muzy, 2014; Kirchner, 2017). However, these are all frequentist methods, which do not allow to quantify the uncertainty of the learned kernels. There exist several work (Blundell et al., 2012; Rasmussen, 2013; Rousseau et al., 2018)

on the Bayesian inference for the Hawkes process. The main difficulty in developing a non-parametric Bayesian method is the quadratic time taken to compute the log-likelihood of the Hawkes process. The related work closest to this paper is the method of

Rousseau et al. (2018)

, who employ Markov chain Monte Carlo (MCMC) to estimate the posterior of the Hawkes model. The main difference between this approach and ours is that our work uses the stochastic branching structure to accelerate the estimation of the posterior Hawkes process.

Contributions

In this paper, we propose a general framework for the efficient non-parametric Bayesian inference of Hawkes processes. We exploit block Gibbs sampling (Ishwaran and James, 2001) to iteratively sample the latent branching structure, the background intensity and the triggering kernel . In each iteration, the point data are decomposed as a cluster of Poisson processes based on the sampled branching structure. This is exemplified in Fig. 3, in which a Hawkes process (the top temporal axis) is decomposed into several Poisson processes (the following temporal axis), and the branching structure is inferred. Posterior and are estimated from the resulting clustered processes. Our framework is close to the stochastic Expectation-Maximization (EM) algorithm (Celeux and Diebolt, 1985) where posterior and are estimated in the M-step and random samples of and are drawn (Adams et al., 2009; Lloyd et al., 2015; Samo and Roberts, 2015; Walder and Bishop, 2017; Gugushvili et al., 2018). We adapt the approach of the recent non-parametric Bayesian estimation for Poisson process intensities, termed Laplace Bayesian Poisson process (LBPP) (Walder and Bishop, 2017), to estimate the posterior given the sampled branching structure. Furthermore, we explore the connection with the EM algorithm (Dempster et al., 1977) and develop a variant of our method as an approximate EM algorithm. On synthetic data, we show our method can infer the exponential and the cosine triggering kernels, and the fitting time of our method is linear to the number of points. On two large social media datasets, our method outperforms the parametric Hawkes process by as much as in holdout log-likelihood.

2 Preliminaries

In this section, we introduce prerequisites: the Hawkes process and LBPP.

Figure 4: Triggering kernels estimated by the Gibbs-Hawkes method (Section 3) and the EM-Hawkes method (Section 4). The true kernel is plotted as the bold gray curve. We plot the median (red) and [0.1, 0.9] interval (filled red) of the approximate predictive distribution, along with the triggering kernel inferred by the EM Hawkes method (blue). The hyper-parameters and of the Gaussian process kernel (Eq. 17) are set to 0.002.

2.1 Hawkes Process

As introduced in Section 1, the Hawkes process can be specified by the conditional intensity function which modulates the arrival rate of points. Mathematically, conditioned on a set of points , the intensity is expressed as (Hawkes, 1971)

(1)

where and are the background intensity and the triggering kernel. The log-likelihood of given the background intensity and the triggering kernel is (Rubin, 1972)

(2)

where is the sampling domain of with measure .

2.2 Lbpp

LBPP (Walder and Bishop, 2017) has been proposed for the non-parametric Bayesian estimating the intensity of a Poisson process. To satisfy non-negativity of the intensity function, LBPP models a point process intensity as a permanental process (Shirai and Takahashi, 2003), i.e., where the link function and is Gaussian process (GP) distributed. Alternative link functions for non-negative intensities include (Rathbun and Cressie, 1994; Møller et al., 1998; Illian et al., 2012; Diggle et al., 2013) and (Adams et al., 2009) where is constant. The choice has analytical and computational advantages (Lloyd et al., 2016; Flaxman et al., 2017; Walder and Bishop, 2017; Lloyd et al., 2015) which lead to a linear time algorithm for LBPP.

LBPP exploits the Mercer expansion (Mercer, 1909) of the GP covariance function ,

(3)

where for LBPP the eigenfunctions

are chosen to be orthonormal in for some sample space with measure . can be represented as a linear combination of (Rasmussen and Williams, 2005),

(4)

and has a Gaussian prior, i.e., where is a diagonal covariance matrix and

is a vector of basis functions. The posterior intensity at some

, , requires estimating the posterior

which is approximated by a normal distribution (a.k.a Laplace approximation)

(Rasmussen and Williams, 2005):

(5)

where X is a set of point data, the sampling space and the Gaussian process kernel function. is selected as the mode of the true posterior and Q the negative inverse Hessian of the true posterior at :

(6)
(7)

The approximate posterior distribution of is a normal distribution (Rasmussen and Williams, 2005):

(8)

Furthermore, the posterior distribution of

is a Gamma distribution

(Walder and Bishop, 2017):

(9)
where

3 Inference via Sampling

We now detail our efficient non-parametric Bayesian estimation algorithm, which samples the posterior of (constant background intensity) and (the trigerring kernel). Our method starts with random and iterates by cycling through the following four steps ( is the iteration index):

  1. Calculate , the distribution of the branching structure given the data, triggering kernel, and background intensity (see Section 3.1).

  2. Sample a using (see Section 3.1).

  3. Estimate (Section 3.3) and (Section 3.2).

  4. Sample a and from and , respectively.

Using standard Gibbs sampling arguments, the samples of and drawn in the step (iv) converge to the desired posterior. As the method is based on block Gibbs sampling (Ishwaran and James, 2001), we term it Gibbs-Hawkes.

3.1 Conditional Distribution and Sampling of the Branching Structure

The probability of the branching structure is the product of the probabilities of the triggering relationships between pairs of points. Given and , the probability of point being triggered by point is the ratio between and (see e.g. (Halpin and De Boeck, 2013)),

(10)

where represents being triggered by , is the number of offspring of and . The first line of Section 3.1 is the probabilistic statement of our question and the second line is due to Bayes’ rule. The third line considers an infinitesimal interval (), and takes the limit so that the ratio of probabilities is transformed as a ratio of intensities. Finally, we use the definition of the intensity. The probability of the point being triggered from the background intensity is similarly computer as the ratio between and ,

(11)

Let us simplify the notation to

Given these probabilities we may sample a branching structure by sampling a parent for each according to probabilities . The sampled branching structure separates a set of points into immigrants and offspring (introduced in Section 1). Immigrants can be regarded as a sequence generated from PP() over , where PP is a Poisson process which has an intensity , and used to estimate the posterior .

The key property which we exploit in the subsequent Section 3.2 and Section 3.3 is the following. Denote by the offspring generated by point . If such a sequence is aligned to an origin at , yielding , then the aligned sequence is drawn from PP() over [0, T-] where is the sample domain of the Hawkes process (Halpin and De Boeck, 2013). The posterior distribution of is estimated on all such aligned sequences.

3.2 Posterior Distribution of

Continuing from the observations in Section 3.1, note that if we are given a set of points generated by PP() over , the likelihood for is the Poisson likelihood,

(12)

For simplicity we place a conjugate (Gamma) prior on (shown as Eq. 13); the Gamma-Poisson conjugate family conveniently gives the posterior distribution of , i.e.

(13)
(14)

We choose the scale and the rate in the Gamma prior by making the mean of the Gamma posterior equal to

and the variance

, which is easily shown to correspond to = N and = 1. Finally due to conjugacy we obtain the posterior distribution,

(15)

3.3 Posterior Distribution of

We handle the posterior distribution of the triggering kernel given the branching structure in an analogous manner to the LBPP method of Walder and Bishop (2017). That is, we assume that where is Gaussian process distributed as described in Section 2.2. In line with (Walder and Bishop, 2017), the specific kernel function we employ is the so-called cosine kernel,

(16)
(17)
(18)

Here, is a multi-index with non-negative (integral) values, is the indicator function, and are parameters controlling the prior smoothness, we have dimensions of data, and we let . This basis is orthonormal over with Lebesgue measure. The expansion Eq. 16 is an explicit kernel construction based on the Mercer expansion as per Eq. 3, but other kernels may be used, for example by Nyström approximation (Flaxman et al., 2017) to the Mercer decomposition.

As mentioned at the end of Section 3.1, by conditioning on the branching structure we may estimate by considering the aligned sequences. In particular, letting denote the aligned sequence generated by

, the joint distribution of

and is calculated as (Walder and Bishop, 2017)

(19)

Note that there is a subtle but important difference between the integral term above and that of Walder and Bishop (2017), namely the limit of integration; closed-form expressions for the present case are provided in Section A.1. Putting the above equation into Eq. 6 and Eq. 7, and we obtain the mean and the covariance of the (Laplace) approximate log-posterior in . Then, according to Eqs. 9 and 8, the posterior is achieved:

(20)
where
(21)

3.4 Computational Complexity

The time complexity of evaluating the LBPP objective function is due to the inversion of an matrix in Eq. 7 where is the number of points. Interestingly, in comparison with LBPP, while our model is in some sense more complex, it enjoys a more favorable computational complexity. For our method, estimating given the branching structure takes constant time. For , the time complexity is with , where is the number of ’s offspring and is the expected . Since the branching factor for is finite, is bounded. As a result, the time complexity for fitting is , that is linear in the number of points. While the naive complexity for is , Halpin (2012) provides methods to reducing it to . In summary, we have the following complexities:

Operation Complexity
where

4 M.A.P Estimation

Figure 5: A visual summary of the Gibbs-Hawkes, EM-Hawkes and the EM algorithms. The differences are (1) the number of sampled branching structures and (2) selected and for . In contrast with with Gibbs-Hawkes, the EM-Hawkes method draws multiple branching structures once and calculates with the M.A.P. and . The EM algorithm is equivalent to sampling infinite branching structures and exploiting M.A.P. or constrained M.L.E. and to calculate (see Section 4).

In this section, we explore the connection between our method and the EM algorithm, and we explain a simple variant which approximates EM.

(a)
(b)
(c)
(d)
Figure 10: L2 Distance between Predicted and True and for Gibbs-Hawkes, EM-Hawkes and the parametric Hawkes process. Two plots on the first row are L-2 distance measured on data generated by the toy Hawkes process equipped with the cosine triggering kernel and the second row the exp triggering kernel. Each column has the median (the orange line) and the mean (number near the box and the green triangle). * before paired methods represents the statistical significance (threshold: 0.05).

Relationship to the EM Algorithm

In Section 1 we mentioned that our method is close to the stochastic EM algorithm (Celeux and Diebolt, 1985). The difference is in the M-step; to obtain an EM algorithm (Dempster et al., 1977) we need only make the following modifications; (1) sample infinite branching structures at each iteration and (2) re-calculate the probability of the branching structure with the maximum-a-posterior (M.A.P.) and given that infinite set of branching structures, our method would be the EM algorithm. More specifically, maximizing the expected log posterior distribution to estimate M.A.P. and given infinite branching structures is equivalent to maximizing the EM objective in the M-step, i.e., maximizing the constrained expected log-likelihood:

(22)

where represents the branching structure, the probabilities of triggering relationships shown as Section 3.1 and Section 3.1, and are parameters of the Gamma prior of . The second line is obtained using Bayes’ rule and the third line is common in EM algorithm based estimations (Lewis and Mohler, 2011; Zhou et al., 2013). Finally, note that (2) mentioned in the beginning of this section is the same as the E-step of the EM algorithm.

EM-Hawkes

The above discussion is suggestive of a variant of the Gibbs-Hawkes algorithm of Section 3, which we name EM-Hawkes, an approximate EM algorithm. Specifically, the difference to Gibbs-Hawkes is that EM-Hawkes (1) samples some finite number of cluster assignments (to approximate the expected log posterior distribution), and (2) finds M.A.P triggering kernels and background intensities rather than sampling them as per block Gibbs sampling (the M-step of the EM algorithm). An overview of the Gibbs-Hawkes, EM-Hawkes and EM algorithms is illustrated in Fig. 5.

Note that under our LBPP-like posterior, finding the most likely triggering kernel is intractable (shown in Section A.3). As an approximation we take the element-wise mode of the marginals of to approximate the mode of posterior, . A natural alternative, which we found to be similar in practice, would be to take the most likely (which is tractable as it is multivariate Gaussian) and apply .

5 Experiments

(a) ACTIVE
(b) SEISMIC
Figure 13: Holdout Log-likelihoods on Twitter diffusion data for Gibbs-Hawkes, EM-Hawkes and the parametric Hawkes process. Each box shows the mean (number near the box and the green triangle) and the median (the orange line). * before paired methods represents the statistical significance (threshold: 0.05).

In this section, we compare our proposed approaches Gibbs-Hawkes and EM-Hawkes to the parametric Hawkes, on synthetic data and on online diffusions data.

5.1 Synthetic Data

We employ two toy Hawkes processes to generate data, both having the same background intensity , and cosine (Eq. 23) and exponential (Eq. 24) triggering kernels respectively:

(23)
(24)

Prediction

For the parametric Hawkes and EM-Hawkes, the predictions and are taken to be the MAP values, while for Gibbs-Hawkes we use the posterior mean.

Evaluation

Each toy model generates 400 point sequences over , which are evenly split into 40 groups, 20 for training and 20 for test. Each of the three methods fit on each group, i.e.

, summing log-likelihoods for 10 sequences (for the parametric Hawkes) or estimating the log posterior probability of the Hawkes process given 10 sequences (for Gibbs-Hawkes and EM-Hawkes). Since the true processes are known, we evaluate based on the L2 distance between predicted and true

and ,

(25)

Experimental details

The parametric Hawkes is equipped with a constant background intensity and an exponential triggering kernel (Eq. 26, below), which has two parameters and is estimated by the maximum likelihood estimation. For Gibbs-Hawkes and EM-Hawkes, we must select parameters of the GP kernel (Eqs. 18, 17 and 16). Having many basis functions leads to a high fitting accuracy, but low speed. We found that using 32 basis functions provides a suitable balance. For kernel parameters of Eq. 17, we choose . 5000 iterations are run to fit each group and first 1000 are ignored (i.e. burned-in).

(26)

Flexibility

Fig. 10 shows the L2 distance between the learned and the true and

on test data (for a paired t-test see the appendix

Table 1). We observe that when the synthetic dataset is generated using a cosine decaying kernel (Fig. 10(a)) both Gibbs-Hawkes and EM-Hawkes perform statistically significant better than the exponential parametric form. For synthetic data generated using an exponential kernel (Fig. 10(c)), the exponential parametric version is only slightly better than our two non-parametric approaches. For estimating , Gibbs-Hawkes performs similarly to the parametric method, and both methods outperform EM-Hawkes (shown in Fig. 10(b)(d)). The learned triggering kernels for and are shown in Figs. 17 and 4

(a). In summary, our methods achieve very good performances for data generated by kernels from several parametric classes, whereas the parametric models are only efficient for data issued from their own class.

Scalability

For Gibbs-Hawkes and EM-Hawkes, we record the average time taken in each iteration for different dataset sizes, shown in the appendix (Fig. 18). The ratio between the computational time and the number of points is roughly constant, which empirically confirms that the complexity of our method is . The main contribution to the scalability of our approaches is the stochastic branching structure which decomposes a Hawkes process into a cluster of Poisson processes. This important advantage of scalability makes our methods applicable to large datasets.

5.2 Twitter Diffusion Data

(a) Exponential synthetic data
(b) ACTIVE vs. SEISMIC
(c) Categories Music vs. Pets
Figure 17: Learned Hawkes triggering kernels using our non-parametric Bayesian approaches. Each red or blue areas show the estimated posterior distributions for , and the solid lines indicate the , and percentiles. (a) A synthetic dataset is simulated using (Eq. 24, shown in gray), and we fit it using Gibbs-Hawkes (in red) and EM-Hawkes (in blue); (b) The learned triggering kernel for Twitter cascades in ACTIVE (in red) and SEISMIC (in blue); (c) The learned triggering kernel for Twitter cascades associated with two categories in the ACTIVE set: Music (in red) and Pets & Animals (in blue)

We evaluate the performance of our two proposed approaches (Gibbs-Hawkes and EM-Hawkes) on two Twitter datasets, containing retweet cascades. A retweet cascade contains an original tweet, together with its direct and indirect retweets. Current state of the art diffusion modeling approaches (Zhao et al., 2015; Mishra et al., 2016; Rizoiu et al., 2018) are based on the self-exciting assumption: users get in contact with online content, and then diffuse it to their friends, therefore generating a cascading effect. The two datasets we use in this work have been employed in prior work and they are publicly available:

  • ACTIVE (Rizoiu et al., 2018) contains 41k retweet cascades, each containing at least 20 (re)tweets with links to Youtube videos. It was collected in 2014 and each Youtube video (and therefore each cascade) is associated with a Youtube category, such as Music, News or Film.

  • SEISMIC (Zhao et al., 2015) contains 166k randomly sampled retweet cascades, collected in from October 7 to November 7, 2011. Each cascade contains at least 50 tweets.

Setup

The temporal extent of each cascade is scaled to , and assigned to either training or test data with equal probability. We bundle together groups of 30 cascades of similar size, and we estimate one Hawkes process for each bundle. Unlike for the synthetic dataset, for the retweet cascades dataset there is no true Hawkes process to evaluate against. Instead, we measure using log-likelihood how well the learned model generalizes to the test set. We use the same hyper-parameters values as for the synthetic data.

Results

For each dataset, we show in Fig. 13 the log-likelihood per event obtained by our approaches and by the exponential parametric Hawkes (Eq. 26), together with significance results (detailed in the appendix Table 2). Visibly, Gibbs-Hawkes and EM-Hawkes consistently outperform the parametric method on both datasets (by 22% and 12% respectively), with EM-Hawkes obtaining slightly better performances than Gibbs-Hawkes (by 0.6% and 0.4% respectively). This seems to indicate that online diffusion is influenced by factors not captured by the parametric kernel, therefore justifying the need to learn the Hawkes kernels non-parametrically.

Interpretation

We show in Fig. 17 the learned kernels for information diffusions. We notice that the learned kernels appear to be decaying and long-tailed, in accordance with the prior literature. Fig. 17(b) shows that the kernel learned on SEISMIC is decaying faster than the kernel learned on ACTIVE. This indicates that non-specific (i.e. random) cascades have a faster decay than video-related cascades, presumably due to the fact that Youtube videos stay longer in the human attention. This connection between the type of content and the speed of the decay seems further confirmed in Fig. 17(c), where we show the learned kernels for two categories in ACTIVE: Music and Pets & Animals. Cascades relating to Pets & Animals have a faster decaying kernel than Music, most likely because Music is an ever-green content.

6 Conclusions

In this paper, we develop a general framework to perform non-parametric Bayesian estimations for the Hawkes processes. Our method iterates two steps. First, it samples the branching structure which transforms the Hawkes process estimations into a cluster of Poisson processes. Next, it estimates the intensity of the Hawkes process using a non-parametric Bayesian estimation of the intensity of the resulted Poisson processes. We analyze the connection between our method and the EM algorithm, and we demonstrate the flexibility and the scalability of our method on synthetic data. On two large Twitter diffusion datasets, our method outperforms the parametric Hawkes process and the learned non-parametric kernel reflects the longevity of different types of content.

Limitations and future work Our current framework is limited to the univariate unmarked Hawkes process. Future work includes extensions to the marked multivariate Hawkes process.

References

  • Adams et al. (2009) Adams, R. P., Murray, I., and MacKay, D. J. (2009). Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    , pages 9–16. ACM.
  • Bacry et al. (2012) Bacry, E., Dayri, K., and Muzy, J.-F. (2012). Non-parametric kernel estimation for symmetric hawkes processes. application to high frequency financial data. The European Physical Journal B, 85(5):157.
  • Bacry and Muzy (2014) Bacry, E. and Muzy, J.-F. (2014). Second order statistics characterization of hawkes processes and non-parametric estimation. arXiv preprint arXiv:1401.0903.
  • Blundell et al. (2012) Blundell, C., Beck, J., and Heller, K. A. (2012). Modelling reciprocating relationships with hawkes processes. In Advances in Neural Information Processing Systems, pages 2600–2608.
  • Celeux and Diebolt (1985) Celeux, G. and Diebolt, J. (1985). The sem algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly, 2:73–82.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1–38.
  • Diggle et al. (2013) Diggle, P. J., Moraga, P., Rowlingson, B., Taylor, B. M., et al. (2013). Spatial and spatio-temporal log-gaussian cox processes: extending the geostatistical paradigm. Statistical Science, 28(4):542–563.
  • Eichler et al. (2017) Eichler, M., Dahlhaus, R., and Dueck, J. (2017). Graphical modeling for multivariate hawkes processes with nonparametric link functions. Journal of Time Series Analysis, 38(2):225–242.
  • Flaxman et al. (2017) Flaxman, S., Teh, Y. W., and Sejdinovic, D. (2017). Poisson intensity estimation with reproducing kernels. In Artificial Intelligence and Statistics, pages 270–279.
  • Gugushvili et al. (2018) Gugushvili, S., van der Meulen, F., Schauer, M., and Spreij, P. (2018). Fast and scalable non-parametric bayesian inference for poisson point processes. arXiv preprint arXiv:1804.03616.
  • Halpin (2012) Halpin, P. F. (2012). An em algorithm for hawkes process. Psychometrika, 2.
  • Halpin and De Boeck (2013) Halpin, P. F. and De Boeck, P. (2013). Modelling dyadic interaction with hawkes processes. Psychometrika, 78(4):793–814.
  • Hawkes (1971) Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90.
  • Hawkes and Oakes (1974) Hawkes, A. G. and Oakes, D. (1974). A cluster process representation of a self-exciting process. Journal of Applied Probability, 11(3):493–503.
  • Illian et al. (2012) Illian, J. B., Sørbye, S. H., and Rue, H. (2012). A toolbox for fitting complex spatial point process models using integrated nested laplace approximation (inla). The Annals of Applied Statistics, pages 1499–1530.
  • Ishwaran and James (2001) Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453):161–173.
  • Kirchner (2017) Kirchner, M. (2017). An estimation procedure for the hawkes process. Quantitative Finance, 17(4):571–595.
  • Kurashima et al. (2018) Kurashima, T., Althoff, T., and Leskovec, J. (2018). Modeling interdependent and periodic real-world action sequences. In Proceedings of the 27th International Conference on World Wide Web, volume 2018, page 803. NIH Public Access.
  • Lewis and Mohler (2011) Lewis, E. and Mohler, G. (2011). A nonparametric em algorithm for multiscale hawkes processes. Journal of Nonparametric Statistics, 1(1):1–20.
  • Lloyd et al. (2015) Lloyd, C., Gunter, T., Osborne, M., and Roberts, S. (2015). Variational inference for gaussian process modulated poisson processes. In International Conference on Machine Learning, pages 1814–1822.
  • Lloyd et al. (2016) Lloyd, C., Gunter, T., Osborne, M., Roberts, S., and Nickson, T. (2016). Latent point process allocation. In Gretton, A. and Robert, C. C., editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 389–397, Cadiz, Spain. PMLR.
  • Mercer (1909) Mercer, J. (1909). Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415–446.
  • Mishra et al. (2016) Mishra, S., Rizoiu, M.-A., and Xie, L. (2016). Feature Driven and Point Process Approaches for Popularity Prediction. Proceedings of International Conference on Information and Knowledge Management - CIKM ’16. arXiv:1608.04862v2. doi:10.1145/2983323.2983812. isbn: 9781450340731.
  • Møller et al. (1998) Møller, J., Syversveen, A. R., and Waagepetersen, R. P. (1998). Log gaussian cox processes. Scandinavian journal of statistics, 25(3):451–482.
  • Rasmussen and Williams (2005) Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
  • Rasmussen (2013) Rasmussen, J. G. (2013). Bayesian inference for hawkes processes. Methodology and Computing in Applied Probability, 15(3):623–642.
  • Rathbun and Cressie (1994) Rathbun, S. L. and Cressie, N. (1994). Asymptotic properties of estimators for the parameters of spatial inhomogeneous poisson point processes. Advances in Applied Probability, 26(1):122–154.
  • Rizoiu et al. (2018) Rizoiu, M.-A., Mishra, S., Kong, Q., Carman, M., and Xie, L. (2018). Sir-hawkes: Linking epidemic models and hawkes processes to model diffusions in finite populations. In Proceedings of the 27th International Conference on World Wide Web, pages 419–428. International World Wide Web Conferences Steering Committee.
  • Rousseau et al. (2018) Rousseau, J., Donnet, S., and Rivoirard, V. (2018). Nonparametric bayesian estimation of multivariate hawkes processes. arXiv preprint arXiv:1802.05975.
  • Rubin (1972) Rubin, I. (1972). Regular point processes and their detection. IEEE Transactions on Information Theory, 18(5):547–557.
  • Samo and Roberts (2015) Samo, Y.-L. K. and Roberts, S. (2015). Scalable nonparametric bayesian inference on point processes with gaussian processes. In International Conference on Machine Learning, pages 2227–2236.
  • Shirai and Takahashi (2003) Shirai, T. and Takahashi, Y. (2003). Random point fields associated with certain fredholm determinants ii: Fermion shifts and their ergodic and gibbs properties. The Annals of Probability, 31(3):1533–1564.
  • Tabibian et al. (2017) Tabibian, B., Valera, I., Farajtabar, M., Song, L., Schölkopf, B., and Gomez-Rodriguez, M. (2017). Distilling information reliability and source trustworthiness from digital traces. In Proceedings of the 26th International Conference on World Wide Web, pages 847–855. International World Wide Web Conferences Steering Committee.
  • Walder and Bishop (2017) Walder, C. J. and Bishop, A. N. (2017). Fast Bayesian intensity estimation for the permanental process. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3579–3588, International Convention Centre, Sydney, Australia. PMLR.
  • Zhao et al. (2015) Zhao, Q., Erdogdu, M. A., He, H. Y., Rajaraman, A., and Leskovec, J. (2015). Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1513–1522. ACM.
  • Zhou et al. (2013) Zhou, K., Zha, H., and Song, L. (2013). Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In Artificial Intelligence and Statistics, pages 641–649.

References

  • Adams et al. (2009) Adams, R. P., Murray, I., and MacKay, D. J. (2009). Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    , pages 9–16. ACM.
  • Bacry et al. (2012) Bacry, E., Dayri, K., and Muzy, J.-F. (2012). Non-parametric kernel estimation for symmetric hawkes processes. application to high frequency financial data. The European Physical Journal B, 85(5):157.
  • Bacry and Muzy (2014) Bacry, E. and Muzy, J.-F. (2014). Second order statistics characterization of hawkes processes and non-parametric estimation. arXiv preprint arXiv:1401.0903.
  • Blundell et al. (2012) Blundell, C., Beck, J., and Heller, K. A. (2012). Modelling reciprocating relationships with hawkes processes. In Advances in Neural Information Processing Systems, pages 2600–2608.
  • Celeux and Diebolt (1985) Celeux, G. and Diebolt, J. (1985). The sem algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly, 2:73–82.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1–38.
  • Diggle et al. (2013) Diggle, P. J., Moraga, P., Rowlingson, B., Taylor, B. M., et al. (2013). Spatial and spatio-temporal log-gaussian cox processes: extending the geostatistical paradigm. Statistical Science, 28(4):542–563.
  • Eichler et al. (2017) Eichler, M., Dahlhaus, R., and Dueck, J. (2017). Graphical modeling for multivariate hawkes processes with nonparametric link functions. Journal of Time Series Analysis, 38(2):225–242.
  • Flaxman et al. (2017) Flaxman, S., Teh, Y. W., and Sejdinovic, D. (2017). Poisson intensity estimation with reproducing kernels. In Artificial Intelligence and Statistics, pages 270–279.
  • Gugushvili et al. (2018) Gugushvili, S., van der Meulen, F., Schauer, M., and Spreij, P. (2018). Fast and scalable non-parametric bayesian inference for poisson point processes. arXiv preprint arXiv:1804.03616.
  • Halpin (2012) Halpin, P. F. (2012). An em algorithm for hawkes process. Psychometrika, 2.
  • Halpin and De Boeck (2013) Halpin, P. F. and De Boeck, P. (2013). Modelling dyadic interaction with hawkes processes. Psychometrika, 78(4):793–814.
  • Hawkes (1971) Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90.
  • Hawkes and Oakes (1974) Hawkes, A. G. and Oakes, D. (1974). A cluster process representation of a self-exciting process. Journal of Applied Probability, 11(3):493–503.
  • Illian et al. (2012) Illian, J. B., Sørbye, S. H., and Rue, H. (2012). A toolbox for fitting complex spatial point process models using integrated nested laplace approximation (inla). The Annals of Applied Statistics, pages 1499–1530.
  • Ishwaran and James (2001) Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453):161–173.
  • Kirchner (2017) Kirchner, M. (2017). An estimation procedure for the hawkes process. Quantitative Finance, 17(4):571–595.
  • Kurashima et al. (2018) Kurashima, T., Althoff, T., and Leskovec, J. (2018). Modeling interdependent and periodic real-world action sequences. In Proceedings of the 27th International Conference on World Wide Web, volume 2018, page 803. NIH Public Access.
  • Lewis and Mohler (2011) Lewis, E. and Mohler, G. (2011). A nonparametric em algorithm for multiscale hawkes processes. Journal of Nonparametric Statistics, 1(1):1–20.
  • Lloyd et al. (2015) Lloyd, C., Gunter, T., Osborne, M., and Roberts, S. (2015). Variational inference for gaussian process modulated poisson processes. In International Conference on Machine Learning, pages 1814–1822.
  • Lloyd et al. (2016) Lloyd, C., Gunter, T., Osborne, M., Roberts, S., and Nickson, T. (2016). Latent point process allocation. In Gretton, A. and Robert, C. C., editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 389–397, Cadiz, Spain. PMLR.
  • Mercer (1909) Mercer, J. (1909). Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415–446.
  • Mishra et al. (2016) Mishra, S., Rizoiu, M.-A., and Xie, L. (2016). Feature Driven and Point Process Approaches for Popularity Prediction. Proceedings of International Conference on Information and Knowledge Management - CIKM ’16. arXiv:1608.04862v2. doi:10.1145/2983323.2983812. isbn: 9781450340731.
  • Møller et al. (1998) Møller, J., Syversveen, A. R., and Waagepetersen, R. P. (1998). Log gaussian cox processes. Scandinavian journal of statistics, 25(3):451–482.
  • Rasmussen and Williams (2005) Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
  • Rasmussen (2013) Rasmussen, J. G. (2013). Bayesian inference for hawkes processes. Methodology and Computing in Applied Probability, 15(3):623–642.
  • Rathbun and Cressie (1994) Rathbun, S. L. and Cressie, N. (1994). Asymptotic properties of estimators for the parameters of spatial inhomogeneous poisson point processes. Advances in Applied Probability, 26(1):122–154.
  • Rizoiu et al. (2018) Rizoiu, M.-A., Mishra, S., Kong, Q., Carman, M., and Xie, L. (2018). Sir-hawkes: Linking epidemic models and hawkes processes to model diffusions in finite populations. In Proceedings of the 27th International Conference on World Wide Web, pages 419–428. International World Wide Web Conferences Steering Committee.
  • Rousseau et al. (2018) Rousseau, J., Donnet, S., and Rivoirard, V. (2018). Nonparametric bayesian estimation of multivariate hawkes processes. arXiv preprint arXiv:1802.05975.
  • Rubin (1972) Rubin, I. (1972). Regular point processes and their detection. IEEE Transactions on Information Theory, 18(5):547–557.
  • Samo and Roberts (2015) Samo, Y.-L. K. and Roberts, S. (2015). Scalable nonparametric bayesian inference on point processes with gaussian processes. In International Conference on Machine Learning, pages 2227–2236.
  • Shirai and Takahashi (2003) Shirai, T. and Takahashi, Y. (2003). Random point fields associated with certain fredholm determinants ii: Fermion shifts and their ergodic and gibbs properties. The Annals of Probability, 31(3):1533–1564.
  • Tabibian et al. (2017) Tabibian, B., Valera, I., Farajtabar, M., Song, L., Schölkopf, B., and Gomez-Rodriguez, M. (2017). Distilling information reliability and source trustworthiness from digital traces. In Proceedings of the 26th International Conference on World Wide Web, pages 847–855. International World Wide Web Conferences Steering Committee.
  • Walder and Bishop (2017) Walder, C. J. and Bishop, A. N. (2017). Fast Bayesian intensity estimation for the permanental process. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3579–3588, International Convention Centre, Sydney, Australia. PMLR.
  • Zhao et al. (2015) Zhao, Q., Erdogdu, M. A., He, H. Y., Rajaraman, A., and Leskovec, J. (2015). Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1513–1522. ACM.
  • Zhou et al. (2013) Zhou, K., Zha, H., and Song, L. (2013). Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In Artificial Intelligence and Statistics, pages 641–649.

Appendix A Appendix

Accompanying the submission Efficient Non-parametric Bayesian Hawkes Processes.

a.1 Computing the Integral Term of the Log-likelihood

We consider , the background intensity , , defined in Eq. 4 and data , and the integral term in the log-likelihood is calculated as below

Integral Term
(27)

In our case, Eq. 18 has , i.e., , . The matrix is calculated as below:

Figure 18: Fitting Time per Point in One Iteration.

a.2 Paired T-test for Synthetic and Real Life Data

The threshold for statistical significance is and paired t-test results are in Tables 2 and 1

Functions Pairs t p
Gibbs-EM 1.452 0.163
Gibbs-exp -12.599 1.133
EM-exp -12.442 1.402
Gibbs-EM -1.866 0.077
Gibbs-exp 0.946 0.356
EM-exp 2.049 0.055
Gibbs-EM 1.714 0.103
Gibbs-exp 1.154 0.263
EM-exp 0.854 0.404
Gibbs-EM -4.227 0.0005
Gibbs-exp 0.588 0.563
EM-exp 3.867 0.001
Table 1: Paired T-test for Synthetic Data
Data Pairs t p
Gibbs-EM -1.357 0.175
ACTIVE Gibbs-exp 62.320 0.0
EM-exp 44.970 0.0
Gibbs-EM -3.090 0.002
SEISMIC Gibbs-exp 98.651 0.0
EM-exp 99.524 0.0
Table 2: Paired T-test for Real Life Data

a.3 Mode-Finding the Triggering Kernel

Here we demonstrate in detail the computational challenges involved in finding the posterior mode with respect to the value of the triggering kernel at multiple point locations. Consider the triggering kernel where is Gaussian process distributed. For a dataset , has a normal distribution, i.e., where and are the mean and the covariance matrix. The distribution of is derived as below where is the cumulative density function and the probabilistic density function.

(28)

where is the Cartesian product. There are summations of exponential functions, which is intractable.