In this paper, we develop a non-parametric Bayesian estimation of Hawkes process kernel functions. Our method is based on the cluster representation of Hawkes processes. We sample random branching structures, and thus split the Hawkes process into clusters of Poisson processes, where the intensity function of each of these processes is the nonparametric triggering kernel of the Hawkes process. We derive both a block Gibbs sampler and a maximum a posteriori estimator based on stochastic expectation maximization. On synthetic data, we show our method to be flexible and scalable, and on two largescale Twitter diffusion datasets, we show our method to outperform the parametric Hawkes model. We observe that the learned non-parametric kernel reflects the longevity of different content types.READ FULL TEXT VIEW PDF
The Hawkes process has been widely applied to modeling self-exciting eve...
Multi-dimensional Hawkes process (MHP) is a class of self and mutually
We propose a statistical method to estimate simultaneously the non-param...
Generative dynamic texture models (GDTMs) are widely used for dynamic te...
I suggest an approach that helps the online marketers to target their
Background: The "proton radius puzzle" refers to an eight-year old probl...
The effectiveness of Bayesian Additive Regression Trees (BART) has been
is the probability ofbeing generated by .
The Hawkes process (Hawkes, 1971) is a useful model to describe self-exciting point data in which every point increases the likelihood of arrival of new points. More specifically, every point causes the conditional intensity function which modulates the arrival rate of new points to increase. An alternative understanding of the Hawkes process is the Poisson cluster process (Hawkes and Oakes, 1974) which categorizes points into immigrants and offspring. Immigrant points are generated independently at a rate of and offspring points are triggered by existing points at a rate of . Points are structured into clusters, associated with each immigrant point, which is called the branching structure (an example shown in Fig. 3).
The Hawkes process has been applied to a wide range of problems. Mishra et al. (2016) employ the branching factor of the Hawkes process to predict popularity of tweets; Kurashima et al. (2018) predict human actions by the Hawkes process; information reliability and information source trustworthiness of online knowledge repositories are evaluated by the Hawkes process (Tabibian et al., 2017). Here, we address two difficulties of leveraging Hawkes processes in such applications: (1) the aforementioned work employ manually designed parametric Hawkes models which may not generalize well and (2) in particular cases, the shape of the Hawkes kernel might not be easy to deduce in advance.
A typical solution is the non-parametric estimation of the intensity function (Lewis and Mohler, 2011; Zhou et al., 2013; Eichler et al., 2017; Bacry et al., 2012; Bacry and Muzy, 2014; Kirchner, 2017). However, these are all frequentist methods, which do not allow to quantify the uncertainty of the learned kernels. There exist several work (Blundell et al., 2012; Rasmussen, 2013; Rousseau et al., 2018)
on the Bayesian inference for the Hawkes process. The main difficulty in developing a non-parametric Bayesian method is the quadratic time taken to compute the log-likelihood of the Hawkes process. The related work closest to this paper is the method ofRousseau et al. (2018)
, who employ Markov chain Monte Carlo (MCMC) to estimate the posterior of the Hawkes model. The main difference between this approach and ours is that our work uses the stochastic branching structure to accelerate the estimation of the posterior Hawkes process.
In this paper, we propose a general framework for the efficient non-parametric Bayesian inference of Hawkes processes. We exploit block Gibbs sampling (Ishwaran and James, 2001) to iteratively sample the latent branching structure, the background intensity and the triggering kernel . In each iteration, the point data are decomposed as a cluster of Poisson processes based on the sampled branching structure. This is exemplified in Fig. 3, in which a Hawkes process (the top temporal axis) is decomposed into several Poisson processes (the following temporal axis), and the branching structure is inferred. Posterior and are estimated from the resulting clustered processes. Our framework is close to the stochastic Expectation-Maximization (EM) algorithm (Celeux and Diebolt, 1985) where posterior and are estimated in the M-step and random samples of and are drawn (Adams et al., 2009; Lloyd et al., 2015; Samo and Roberts, 2015; Walder and Bishop, 2017; Gugushvili et al., 2018). We adapt the approach of the recent non-parametric Bayesian estimation for Poisson process intensities, termed Laplace Bayesian Poisson process (LBPP) (Walder and Bishop, 2017), to estimate the posterior given the sampled branching structure. Furthermore, we explore the connection with the EM algorithm (Dempster et al., 1977) and develop a variant of our method as an approximate EM algorithm. On synthetic data, we show our method can infer the exponential and the cosine triggering kernels, and the fitting time of our method is linear to the number of points. On two large social media datasets, our method outperforms the parametric Hawkes process by as much as in holdout log-likelihood.
In this section, we introduce prerequisites: the Hawkes process and LBPP.
As introduced in Section 1, the Hawkes process can be specified by the conditional intensity function which modulates the arrival rate of points. Mathematically, conditioned on a set of points , the intensity is expressed as (Hawkes, 1971)
where and are the background intensity and the triggering kernel. The log-likelihood of given the background intensity and the triggering kernel is (Rubin, 1972)
where is the sampling domain of with measure .
LBPP (Walder and Bishop, 2017) has been proposed for the non-parametric Bayesian estimating the intensity of a Poisson process. To satisfy non-negativity of the intensity function, LBPP models a point process intensity as a permanental process (Shirai and Takahashi, 2003), i.e., where the link function and is Gaussian process (GP) distributed. Alternative link functions for non-negative intensities include (Rathbun and Cressie, 1994; Møller et al., 1998; Illian et al., 2012; Diggle et al., 2013) and (Adams et al., 2009) where is constant. The choice has analytical and computational advantages (Lloyd et al., 2016; Flaxman et al., 2017; Walder and Bishop, 2017; Lloyd et al., 2015) which lead to a linear time algorithm for LBPP.
LBPP exploits the Mercer expansion (Mercer, 1909) of the GP covariance function ,
where for LBPP the eigenfunctionsare chosen to be orthonormal in for some sample space with measure . can be represented as a linear combination of (Rasmussen and Williams, 2005),
and has a Gaussian prior, i.e., where is a diagonal covariance matrix and
is a vector of basis functions. The posterior intensity at some, , requires estimating the posterior
which is approximated by a normal distribution (a.k.a Laplace approximation)(Rasmussen and Williams, 2005):
where X is a set of point data, the sampling space and the Gaussian process kernel function. is selected as the mode of the true posterior and Q the negative inverse Hessian of the true posterior at :
The approximate posterior distribution of is a normal distribution (Rasmussen and Williams, 2005):
Furthermore, the posterior distribution of
is a Gamma distribution(Walder and Bishop, 2017):
We now detail our efficient non-parametric Bayesian estimation algorithm, which samples the posterior of (constant background intensity) and (the trigerring kernel). Our method starts with random and iterates by cycling through the following four steps ( is the iteration index):
Calculate , the distribution of the branching structure given the data, triggering kernel, and background intensity (see Section 3.1).
Sample a using (see Section 3.1).
Sample a and from and , respectively.
Using standard Gibbs sampling arguments, the samples of and drawn in the step (iv) converge to the desired posterior. As the method is based on block Gibbs sampling (Ishwaran and James, 2001), we term it Gibbs-Hawkes.
The probability of the branching structure is the product of the probabilities of the triggering relationships between pairs of points. Given and , the probability of point being triggered by point is the ratio between and (see e.g. (Halpin and De Boeck, 2013)),
where represents being triggered by , is the number of offspring of and . The first line of Section 3.1 is the probabilistic statement of our question and the second line is due to Bayes’ rule. The third line considers an infinitesimal interval (), and takes the limit so that the ratio of probabilities is transformed as a ratio of intensities. Finally, we use the definition of the intensity. The probability of the point being triggered from the background intensity is similarly computer as the ratio between and ,
Let us simplify the notation to
Given these probabilities we may sample a branching structure by sampling a parent for each according to probabilities . The sampled branching structure separates a set of points into immigrants and offspring (introduced in Section 1). Immigrants can be regarded as a sequence generated from PP() over , where PP is a Poisson process which has an intensity , and used to estimate the posterior .
The key property which we exploit in the subsequent Section 3.2 and Section 3.3 is the following. Denote by the offspring generated by point . If such a sequence is aligned to an origin at , yielding , then the aligned sequence is drawn from PP() over [0, T-] where is the sample domain of the Hawkes process (Halpin and De Boeck, 2013). The posterior distribution of is estimated on all such aligned sequences.
Continuing from the observations in Section 3.1, note that if we are given a set of points generated by PP() over , the likelihood for is the Poisson likelihood,
For simplicity we place a conjugate (Gamma) prior on (shown as Eq. 13); the Gamma-Poisson conjugate family conveniently gives the posterior distribution of , i.e.
We choose the scale and the rate in the Gamma prior by making the mean of the Gamma posterior equal to
and the variance, which is easily shown to correspond to = N and = 1. Finally due to conjugacy we obtain the posterior distribution,
We handle the posterior distribution of the triggering kernel given the branching structure in an analogous manner to the LBPP method of Walder and Bishop (2017). That is, we assume that where is Gaussian process distributed as described in Section 2.2. In line with (Walder and Bishop, 2017), the specific kernel function we employ is the so-called cosine kernel,
Here, is a multi-index with non-negative (integral) values, is the indicator function, and are parameters controlling the prior smoothness, we have dimensions of data, and we let . This basis is orthonormal over with Lebesgue measure. The expansion Eq. 16 is an explicit kernel construction based on the Mercer expansion as per Eq. 3, but other kernels may be used, for example by Nyström approximation (Flaxman et al., 2017) to the Mercer decomposition.
As mentioned at the end of Section 3.1, by conditioning on the branching structure we may estimate by considering the aligned sequences. In particular, letting denote the aligned sequence generated by
, the joint distribution ofand is calculated as (Walder and Bishop, 2017)
Note that there is a subtle but important difference between the integral term above and that of Walder and Bishop (2017), namely the limit of integration; closed-form expressions for the present case are provided in Section A.1. Putting the above equation into Eq. 6 and Eq. 7, and we obtain the mean and the covariance of the (Laplace) approximate log-posterior in . Then, according to Eqs. 9 and 8, the posterior is achieved:
The time complexity of evaluating the LBPP objective function is due to the inversion of an matrix in Eq. 7 where is the number of points. Interestingly, in comparison with LBPP, while our model is in some sense more complex, it enjoys a more favorable computational complexity. For our method, estimating given the branching structure takes constant time. For , the time complexity is with , where is the number of ’s offspring and is the expected . Since the branching factor for is finite, is bounded. As a result, the time complexity for fitting is , that is linear in the number of points. While the naive complexity for is , Halpin (2012) provides methods to reducing it to . In summary, we have the following complexities:
In this section, we explore the connection between our method and the EM algorithm, and we explain a simple variant which approximates EM.
In Section 1 we mentioned that our method is close to the stochastic EM algorithm (Celeux and Diebolt, 1985). The difference is in the M-step; to obtain an EM algorithm (Dempster et al., 1977) we need only make the following modifications; (1) sample infinite branching structures at each iteration and (2) re-calculate the probability of the branching structure with the maximum-a-posterior (M.A.P.) and given that infinite set of branching structures, our method would be the EM algorithm. More specifically, maximizing the expected log posterior distribution to estimate M.A.P. and given infinite branching structures is equivalent to maximizing the EM objective in the M-step, i.e., maximizing the constrained expected log-likelihood:
where represents the branching structure, the probabilities of triggering relationships shown as Section 3.1 and Section 3.1, and are parameters of the Gamma prior of . The second line is obtained using Bayes’ rule and the third line is common in EM algorithm based estimations (Lewis and Mohler, 2011; Zhou et al., 2013). Finally, note that (2) mentioned in the beginning of this section is the same as the E-step of the EM algorithm.
The above discussion is suggestive of a variant of the Gibbs-Hawkes algorithm of Section 3, which we name EM-Hawkes, an approximate EM algorithm. Specifically, the difference to Gibbs-Hawkes is that EM-Hawkes (1) samples some finite number of cluster assignments (to approximate the expected log posterior distribution), and (2) finds M.A.P triggering kernels and background intensities rather than sampling them as per block Gibbs sampling (the M-step of the EM algorithm). An overview of the Gibbs-Hawkes, EM-Hawkes and EM algorithms is illustrated in Fig. 5.
Note that under our LBPP-like posterior, finding the most likely triggering kernel is intractable (shown in Section A.3). As an approximation we take the element-wise mode of the marginals of to approximate the mode of posterior, . A natural alternative, which we found to be similar in practice, would be to take the most likely (which is tractable as it is multivariate Gaussian) and apply .
In this section, we compare our proposed approaches Gibbs-Hawkes and EM-Hawkes to the parametric Hawkes, on synthetic data and on online diffusions data.
For the parametric Hawkes and EM-Hawkes, the predictions and are taken to be the MAP values, while for Gibbs-Hawkes we use the posterior mean.
Each toy model generates 400 point sequences over , which are evenly split into 40 groups, 20 for training and 20 for test. Each of the three methods fit on each group, i.e.
, summing log-likelihoods for 10 sequences (for the parametric Hawkes) or estimating the log posterior probability of the Hawkes process given 10 sequences (for Gibbs-Hawkes and EM-Hawkes). Since the true processes are known, we evaluate based on the L2 distance between predicted and trueand ,
The parametric Hawkes is equipped with a constant background intensity and an exponential triggering kernel (Eq. 26, below), which has two parameters and is estimated by the maximum likelihood estimation. For Gibbs-Hawkes and EM-Hawkes, we must select parameters of the GP kernel (Eqs. 18, 17 and 16). Having many basis functions leads to a high fitting accuracy, but low speed. We found that using 32 basis functions provides a suitable balance. For kernel parameters of Eq. 17, we choose . 5000 iterations are run to fit each group and first 1000 are ignored (i.e. burned-in).
Fig. 10 shows the L2 distance between the learned and the true and
on test data (for a paired t-test see the appendixTable 1). We observe that when the synthetic dataset is generated using a cosine decaying kernel (Fig. 10(a)) both Gibbs-Hawkes and EM-Hawkes perform statistically significant better than the exponential parametric form. For synthetic data generated using an exponential kernel (Fig. 10(c)), the exponential parametric version is only slightly better than our two non-parametric approaches. For estimating , Gibbs-Hawkes performs similarly to the parametric method, and both methods outperform EM-Hawkes (shown in Fig. 10(b)(d)). The learned triggering kernels for and are shown in Figs. 17 and 4
(a). In summary, our methods achieve very good performances for data generated by kernels from several parametric classes, whereas the parametric models are only efficient for data issued from their own class.
For Gibbs-Hawkes and EM-Hawkes, we record the average time taken in each iteration for different dataset sizes, shown in the appendix (Fig. 18). The ratio between the computational time and the number of points is roughly constant, which empirically confirms that the complexity of our method is . The main contribution to the scalability of our approaches is the stochastic branching structure which decomposes a Hawkes process into a cluster of Poisson processes. This important advantage of scalability makes our methods applicable to large datasets.
We evaluate the performance of our two proposed approaches (Gibbs-Hawkes and EM-Hawkes) on two Twitter datasets, containing retweet cascades. A retweet cascade contains an original tweet, together with its direct and indirect retweets. Current state of the art diffusion modeling approaches (Zhao et al., 2015; Mishra et al., 2016; Rizoiu et al., 2018) are based on the self-exciting assumption: users get in contact with online content, and then diffuse it to their friends, therefore generating a cascading effect. The two datasets we use in this work have been employed in prior work and they are publicly available:
ACTIVE (Rizoiu et al., 2018) contains 41k retweet cascades, each containing at least 20 (re)tweets with links to Youtube videos. It was collected in 2014 and each Youtube video (and therefore each cascade) is associated with a Youtube category, such as Music, News or Film.
SEISMIC (Zhao et al., 2015) contains 166k randomly sampled retweet cascades, collected in from October 7 to November 7, 2011. Each cascade contains at least 50 tweets.
The temporal extent of each cascade is scaled to , and assigned to either training or test data with equal probability. We bundle together groups of 30 cascades of similar size, and we estimate one Hawkes process for each bundle. Unlike for the synthetic dataset, for the retweet cascades dataset there is no true Hawkes process to evaluate against. Instead, we measure using log-likelihood how well the learned model generalizes to the test set. We use the same hyper-parameters values as for the synthetic data.
For each dataset, we show in Fig. 13 the log-likelihood per event obtained by our approaches and by the exponential parametric Hawkes (Eq. 26), together with significance results (detailed in the appendix Table 2). Visibly, Gibbs-Hawkes and EM-Hawkes consistently outperform the parametric method on both datasets (by 22% and 12% respectively), with EM-Hawkes obtaining slightly better performances than Gibbs-Hawkes (by 0.6% and 0.4% respectively). This seems to indicate that online diffusion is influenced by factors not captured by the parametric kernel, therefore justifying the need to learn the Hawkes kernels non-parametrically.
We show in Fig. 17 the learned kernels for information diffusions. We notice that the learned kernels appear to be decaying and long-tailed, in accordance with the prior literature. Fig. 17(b) shows that the kernel learned on SEISMIC is decaying faster than the kernel learned on ACTIVE. This indicates that non-specific (i.e. random) cascades have a faster decay than video-related cascades, presumably due to the fact that Youtube videos stay longer in the human attention. This connection between the type of content and the speed of the decay seems further confirmed in Fig. 17(c), where we show the learned kernels for two categories in ACTIVE: Music and Pets & Animals. Cascades relating to Pets & Animals have a faster decaying kernel than Music, most likely because Music is an ever-green content.
In this paper, we develop a general framework to perform non-parametric Bayesian estimations for the Hawkes processes. Our method iterates two steps. First, it samples the branching structure which transforms the Hawkes process estimations into a cluster of Poisson processes. Next, it estimates the intensity of the Hawkes process using a non-parametric Bayesian estimation of the intensity of the resulted Poisson processes. We analyze the connection between our method and the EM algorithm, and we demonstrate the flexibility and the scalability of our method on synthetic data. On two large Twitter diffusion datasets, our method outperforms the parametric Hawkes process and the learned non-parametric kernel reflects the longevity of different types of content.
Limitations and future work Our current framework is limited to the univariate unmarked Hawkes process. Future work includes extensions to the marked multivariate Hawkes process.
Proceedings of the 26th Annual International Conference on Machine Learning, pages 9–16. ACM.
Proceedings of the 26th Annual International Conference on Machine Learning, pages 9–16. ACM.
Accompanying the submission Efficient Non-parametric Bayesian Hawkes Processes.
We consider , the background intensity , , defined in Eq. 4 and data , and the integral term in the log-likelihood is calculated as below
In our case, Eq. 18 has , i.e., , . The matrix is calculated as below:
Here we demonstrate in detail the computational challenges involved in finding the posterior mode with respect to the value of the triggering kernel at multiple point locations. Consider the triggering kernel where is Gaussian process distributed. For a dataset , has a normal distribution, i.e., where and are the mean and the covariance matrix. The distribution of is derived as below where is the cumulative density function and the probabilistic density function.
where is the Cartesian product. There are summations of exponential functions, which is intractable.