1 Introduction
The Hawkes process (Hawkes, 1971) is particularly useful to model selfexciting point data – i.e., when the occurrence of a point increases the likelihood of occurrence of new points. The process is parametrised using a background intensity , and a triggering kernel . The Hawkes process can be alternatively represented as a cluster of Poisson processes (Hawkes and Oakes, 1974), where immigrant points are generated following a Poisson process with an intensity (denoted as PP()), and every existing point triggers new points following a Poisson process PP(). Points can therefore be structured into clusters – also known as the branching structure –, where each cluster contains a point and its direct offsprings. The triggering kernel is shared among all cluster Poisson processes relating to a Hawkes process, and it determines the overall behavior of the process. Consequently, designing the kernel functions is of utmost importance for employing the Hawkes process to a new application, and its study has attracted much attention.
Related Work.
In the case in which the optimal triggering kernel for a particular application is unknown, a typical solution is to express it using a nonparametric form such as in the work of Lewis and Mohler (2011); Zhou et al. (2013); Bacry and Muzy (2014); Eichler et al. (2017)
. These are all frequentist methods which do not quantify the uncertainty of learned triggering kernels. The Bayesian inference for Hawkes processes has also been studied including
(Rasmussen, 2013; Linderman and Adams, 2015) which require either constructing a parametric triggering kernels or discretizing the input domain to scale with the data size. The latter approach leads to poor scaling with the dimension of the domain. Donnet et al. (2018)overcome these shortcomings by proposing a continuous nonparametric Bayesian Hawkes process, however they resort to an unscalable Markov chain Monte Carlo (MCMC) estimator to the posterior distribution. More recently, a solution based on Gibbs sampling
(Zhang et al., 2019) obtains linear time complexity by exploiting the Hawkes processes’ branching structure and the stationarity of the triggering kernel, but this approach may suffer from slow convergence.Contributions.
In this paper, we address the shortcoming of the previous solutions and we make several contributions as follows. (1) We introduce a new Bayesian nonparametric Hawkes process in which the triggering kernel is modulated by a sparse Gaussian process (Lloyd et al., 2015)
, and the background intensity is Gamma distributed. We further introduce a variational inference scheme for such a model which allows fast convergence. We denote this approach as Variational Bayesian Hawkes Processes (
VBHP). (2)We obtain linear time complexity by utilizing the stationarity of the triggering kernel. To our best knowledge, this is the first sparse Gaussian process modulated Hawkes process which employs the variational inference, which has a linear time complexity and which scales to large real world data. We additionally contribute a tighter lower bound than the evidence lower bound to approximate the marginal likelihood. This tighter lower bound also holds for other graphical models, such as the variational Gaussian mixture models
(Attias, 1999). (3) We further show that VBHP provides more accurate prediction than stateoftheart methods on synthetic data and on two large scale online diffusion datasets. We also validate the efficiency of VBHP, and we demonstrate that the approximate marginal likelihood is effective for performing model selection.2 Prerequisite
In this section, we review the Hawkes process, the variational inference and the variational Bayesian Poisson process (Lloyd et al., 2015).
2.1 Hawkes Process (HP)
The Hawkes process (Hawkes, 1971) is a selfexciting point process, in which the occurrence of a point increases the arrival rate , i.e. the (conditional) intensity, of new points. Given a set of points , the intensity at conditioned on given points is written as:
(1) 
where , considered as a constant, is the background intensity, and is the triggering kernel defined over the data domain . The loglikelihood of for and is given by Rubin (1972):
(2) 
2.2 Variational Inference (VI)
Consider the latent variable model where and are the observed and the latent variables. The variational approach introduces a variational distribution to approximate the posterior distribution and maximizes a lower bound of the loglikelihood, which can be derived from the nonnegative gap perspective:
(3) 
Optimising the evidence lower bound (ELBO) w.r.t. balances between the reconstruction error and the KullbackLeibler (KL) divergence from the prior. Generally, the conditional is known, so is the prior. Thus, for an appropriate choice of , it is easier to work with this lower bound than the intractable posterior . We also see that, due to the form of the intractable gap, if is a family of distributions which contains elements close to the true unknown posterior, then the variational lower bound will be close to the true likelihood, and will be close to the true posterior. An alternative derivation applies Jensen’s inequality:
2.3 Variational Bayesian Poisson Process (VBPP)
VBPP (Lloyd et al., 2015) applies the VI to the Bayesian Poisson process, which exploits the sparse Gaussian process (GP) to model the Poisson intensity. Specifically, the sparse GP employs the ARD kernel
(4) 
where and are GP hyperparameters. Let where are inducing points and is from the sparse GP. The prior and the approximate posterior distributions of are Gaussians and , where , and
are the dataset, the mean vector and the covariance matrix respectively. Note, both
and employ zero mean priors. Notations of VBPP are connected with those of Section 2.2 in Table 1.VBHP  VBPP  Section 2.2 

Importantly, the approximate joint distribution of
and uses the exact conditional distribution , i.e.,(5) 
which in turn leads to the posterior GP
(6) 
Then, the ELBO can be simplified as below
(7) 
where to make the notation concise, we omit and in conditions of variational distributions and in conditions of priors and likelihoods. The first term on r.h.s. is simplified due to conditional independence between and , while the second term due to the exact conditional . Notably, the second term is the KL divergence between two multivariate Gaussians, so is available in closed form. The first term can be easily handled for a squared link function mapping to the Poisson intensity function and it turns out to be the expectation w.r.t. of the loglikelihood . The expectation of the integral part is relatively straightforward to compute and the expectation of the other (datadependent) part is available in almost closedform with a hypergeometric function making things a little tricky to implement.
3 Variational Bayesian Hawkes Process (VBHP)
Notations
To extend VBPP to HP, we introduce a latent branching structure and a latent background intensity . The same priors are adopted for and , namely, and , and the same approximate joint posterior on and as Eq. 5. We assume that the prior distribution of is a Gamma distribution, i.e., ^{1}^{1}1, and that the posterior distribution is approximated by another Gamma distribution, . Notations of VBHP are summarized in Table 1.
We further employ the approximate posterior distribution of the branching structure, , to complete our variational distribution, so that
(8) 
Based on this, the VBHP ELBO can be available in closed form.
Elbo
First, we concisely derive the VBHP ELBO as below (see details in Section A.1 (App., 2019)):
(9) 
where the second line is obtained by simplifications analogous to those of Eq. 7 and the third line is due to merging the reconstruction term with the KL term of . The KL terms w.r.t. and
are KL divergences between gamma distributions and Gaussian distributions respectively, which have closedform expressions:
(10)  
(11) 
where and are Gamma and Digamma functions. Now, we are left with the problem of computing the datadependent expectation of Section 3.
Data Dependent Expectation (DDE)
The derivation of DDE has two major steps. First, we show the expectation of the loglikelihood w.r.t. the variational distribution of the branching structure , which is given as below (see details in Section A.2 (App., 2019)):
(12) 
where is the approximate posterior distribution of triggering ( represents the background) and for notational simplification, we introduce and . Interestingly, the branching structure decomposes the Hawkes loglikelihood into weighted Poisson loglikelihoods which allows existing Bayesian Poisson process models to be applied to Hawkes processes.
Second, due to the independence among variational , and , plugging Eq. 12 into the DDE term of Section 3 results in separate expectations over , and :
(13) 
where is the entropy of the branching structure variable and is identical to the counterpart of VBPP (as Eq. 6). Maximizing the VBHP ELBO makes a balance among three parts: the expected log joint likelihood of point data and the branching structure, the diversity of the branching structure and KL divergences over the background intensity and over the inducing points.
As in VBPP, the term is available in the closed form with a hypergeometric function:
(14) 
where , ( and are defined in Eq. 6), is the EulerMascheroni constant and is defined as , i.e., the partial derivative of the confluent hypergeometric function w.r.t. the first argument. The computation of can use the method of Ancarani and Gasaneo (2008). We implement and
by linear interpolation of a lookup table (see the demo from supplementary material).
For the integral , Lloyd et al. (2015) give analytical expressions and we adapt them here as:
(15) 
where ^{2}^{2}2 is the Cartesian product. is the sampling region for trunctated by and each integral term has the closed form (see derivations in Sections A.5, A.4 and A.3 (App., 2019)):
(16)  
(17)  
(18) 
where .
Optimisation
To optimise the model parameters, we employ the EM algorithm. Specifically, ’s are optimised in the E step and , , and are updated to increase ELBO in the M step. The updating equations for can be derived by the Lagrange multiplier method, which are directly given as below:
(19) 
Similarly to VBPP, we fix inducing points on a regular grid over
, which reduces the function variance at locations far from data. The selection of GP kernel hyperparameters is based on maximizing the marginal likelihood which is approximated by a tighter lower bound given in the following paragraph. Despite the observation that more inducing points lead to better fitting accuracy
(Lloyd et al., 2015; Snelson and Ghahramani, 2006), in the case of our more complex VBHP, more inducing points may cause slow convergence (Fig. (a)a (App., 2019)) under some hyperparameters and therefore poor performance in limited iterations. Besides, more inducing points need more fit time, so the suitable number of inducing points makes a balance between accuracy and the time cost.Marginal Likelihood
A common approximation to the log marginal likelihood is the ELBO. For our VBHP, we derive a tighter approximation to the marginal likelihood (see the derivation in Appendix B (App., 2019)):
(20) 
which actually is the datadependent expectation of the ELBO. This approximation is tighter because it doesn’t subtract nonnegative KL divergences over the background intensity and over the inducing points. However, it doesn’t guarantee minimizing the intractable KL divergences between approximate and true posterior background intensity and inducing points, that is, the intractable nonnegative gap in Eq. 3. This is an important reason why tighter variational bounds are not necessarily better (Rainforth et al., 2018) and we therefore don’t use it in place of the ELBO as the objective function. Some of other graphical models, such as the variational Gaussian mixture model (Attias, 1999), have the similar tighter lower bounds.
Time Complexity and Acceleration
In the E step of model optimisation, updating ’s requires computing the mean and the variance of all ’s, which both take . Here we omit the data dimension since normally for regulargridlocated inducing points. Similarly in the M step, computing the hypergeometric term needs the mean and the variance of all ’s. Further computation of the integral terms takes . Thus, the total time complexity per iteration is .
To accelerate VBHP, we utilise the stationarity of the triggering kernel as (Zhang et al., 2019), which says for input larger than some point, the kernel has negligible values. This is rational as a HP with finite expected size has the integral of the triggering kernel over [0,) less than 1. Our work scales the domain to some without breaking the stationarity. Specifically, we assume a region as the support of the triggering kernel and there is for any , which reduces possible parents of a point from the whole history to local points in the support. As a result, the total time complexity can be reduced to .
Predictive Distribution of the Triggering Kernel
The predictive distribution of depends on the posterior and we assume that the optimal variational distribution of approximates the true posterior distribution, namely . Therefore, there is , i.e., the approximate predictive . Given the relation , it is straightforward to derive the corresponding where the shape and the scale .
Extension to the Multivariate Hawkes Process
The above derivations are based on the univariate Hawkes process, which is easy to understand compared with the counterpart for the multivariate Hawkes process, and the latter can be obtained following the same derivations.
4 Experiments
Evaluation
We employ two metrics: the distance (for ground truth known only) to measure the difference between predictive and true models, formulated as and ; the heldout log likelihood (HLL, shown as Eq. 2
) to describe how well the predictive model fits data. More specifically for HLL, points in each original datum are randomly assigned to either training or test data with equal probabilities and evaluations over a number of test data are averaged as summarisation. To reduce the impact of the sequence length, we divide the HLL by the number of points.
Prediction
For two metrics employed above, we use the pointwise mode of the posterior triggering kernel as the prediction because it is computationally intractable to find the posterior mode at multiple point locations (Zhang et al., 2019). Besides, we exploit the mode of the posterior background intensity as the predictive background intensity.
Baselines
We use the following models as baselines. (1) A parametric Hawkes process equipped with the sum of exponential (SumExp) triggering kernel and the constant background intensity. (2)
The ordinary differential equation (
ODE) based nonparametric nonbayesian Hawkes process (Zhou et al., 2013). The code is publicly available (Bacry et al., 2017). (3) WienerHopf (WH) equation based nonparametric nonbayesian Hawkes process (Zhou et al., 2013). The code is publicly available (Bacry et al., 2017). (4) The gibbs sampling based Bayesian nonparametric Hawkes process (Gibbs Hawkes) (Zhang et al., 2019). For fairness, the ARD kernel is exploited and corresponding eigenfunctions are approximated by Nyström method
(Williams and Rasmussen, 2006), where regular grid points are used as VBHP. Different from (Zhang et al., 2019), Gibbs Hawkes is trained on single sequences.4.1 On Synthetic Data
Our synthetic data are generated from three Hawkes processes over , whose triggering kernels are Sine (sin, Eq. 21), Cosine (cos, Eq. 22) and exponential (exp, Eq. 23) functions respectively, and whose background intensities are the same, .
(21)  
(22)  
(23) 
Model Selection
As the marginal likelihood is a key advantage of our method over nonBayesian approaches (Zhou et al., 2013; Bacry and Muzy, 2014), we investigate its efficacy for model selection. Fig. (b)b shows the contour plot of the approximate log marginal likelihood of a single sample. The observation is that the contour plot of the approximate marginal likelihood has a similar shape to that of the contour plots of (Fig. (c)c) and of HLL (Fig. (d)d) — GP hyperparameters with relatively high marginal likelihoods have relatively low errors and relatively high HLL. Fig. (a)a plots the posterior triggering kernel corresponding to the maximal approximate marginal likelihood. This demonstrates the practical utility of both the marginal likelihood itself and our approximation to it. Gibbs Hawkes has a similar observation as Fig. 18 (App., 2019).
Evaluation
To evaluate VBHP on synthetic data, 20 sequences are drawn from each model and 100 pairs of train and test sequences drawn from each sample to compute the HLL. We select GP hyperparameters of Gibbs Hawkes and of VBHP by maximizing the approximate marginal likelihood. Table 2 shows evaluations for baselines and VBHP (using 10 inducing points for tradeoff between accuracy and time, so does Gibbs Hawkes) in both and HLL. VBHP achieves the best performance but is two orders of magnitudes slower than Gibbs Hawkes (shown as Figs. (b)b and (a)a). Interestingly, SumExp fits best by using a single exponential function in distance while due to learning on single sequences, the background intensity has relatively high errors.
4.2 Real World Data
We exploit two large scale tweet datasets to further conclude. ACTIVE (Rizoiu et al., 2018) is a tweet dataset which was collected in 2014 and contains around 41k (re)tweet temporal point processes with links to Youtube videos. Each sequence contains at least 20 (re)tweets. SEISMIC (Zhao et al., 2015) is large scale tweet dataset which was collected from October 7 to November 7, 2011 and contains around 166k (re)tweet temporal point processes. Each sequence contains at least 50 (re)tweets.
Evaluation
Similarly to synthetic experiments, we evaluate the fitting performance by averaging HLL of 20 test sequences randomly drawn from each original datum. We scale all original data to and use 10 inducing points for VBHP and Gibbs Hawkes in terms of time and accuracy. The model selection is still by maximizing the approximate marginal likelihood. The results are shown in Table 2 and again we observe similar predictive performance of VBHP which performs best. This additionally demonstrates the marginal likelihood and our approximation are useful in practice.
Fit Time
We further evaluate the fitting speed ^{3}^{3}3The CPU we use is Intel(R) Core(TM) i77700 CPU @ 3.60GHz and the language is Python 3.6.5. of VBHP and Gibbs Hawkes on sample synthetic and real world point processes, summarized in Figs. (b)b and (a)a. The fit time is averaged over iterations and it is observed that the increasing trends with the number of inducing points and with the data size are similar between Gibbs Hawkes and VBHP. Although VBHP is significantly slower than Gibbs Hawkes, VBHP converges in 1020 iterations (Fig. 14 (App., 2019)), giving an average convergence time of 549 seconds for a sequence of 1000 events, compared to 699 seconds for Gibbs Hawkes. The slope of VBHP in Fig. (b)b (logscale) is 1.04 and the correlation coefficient is 0.96, so the fit time is linear to the data size.
Measure  Data  SumExp  ODE  WH  Gibbs Hawkes  VBHP 

Sin  :  
:  
Cos  :  
:  
Exp  :  
:  
HLL  Sin  
Cos  
Exp  
ACTIVE  
SEISMIC 
5 Conclusions
In this paper, we provided the VI scheme for a new Bayesian nonparametric Hawkes process whose triggering kernel is modulated by a sparse Gaussian process and background intensity is Gamma distributed. To address the difficulty with scaling with the data size, we utilize the stationarity of the triggering kernel to reduce the number of possible parents of each point. We additionally contribute a tighter lower bound of the marginal likelihood which is empirically shown to be effective for model selection. On synthetic data and two large Twitter diffusion datasets, VBHP enjoys linear fit time with the data size and fast convergence rate, and provides more accurate prediction than those of stateoftheart approaches.
References
 Ancarani and Gasaneo (2008) Ancarani, L. U. and Gasaneo, G. (2008). Derivatives of any order of the confluent hypergeometric function f 1 1 (a, b, z) with respect to the parameter a or b. Journal of Mathematical Physics, 49(6):063508.
 App. (2019) App. (2019). Appendix: Variational inference for sparse gaussian process modulated hawkes process. In the supplementary material.

Attias (1999)
Attias, H. (1999).
Inferring parameters and structure of latent variable models by
variational bayes.
In
Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
, pages 21–30. Morgan Kaufmann Publishers Inc.  Bacry et al. (2017) Bacry, E., Bompaire, M., Gaïffas, S., and Poulsen, S. (2017). tick: a Python library for statistical learning, with a particular emphasis on timedependent modeling. ArXiv eprints.
 Bacry and Muzy (2014) Bacry, E. and Muzy, J.F. (2014). Second order statistics characterization of hawkes processes and nonparametric estimation. arXiv preprint arXiv:1401.0903.
 Donnet et al. (2018) Donnet, S., Rivoirard, V., and Rousseau, J. (2018). Nonparametric bayesian estimation of multivariate hawkes processes. arXiv preprint arXiv:1802.05975.
 Eichler et al. (2017) Eichler, M., Dahlhaus, R., and Dueck, J. (2017). Graphical modeling for multivariate hawkes processes with nonparametric link functions. Journal of Time Series Analysis, 38(2):225–242.
 Hawkes (1971) Hawkes, A. G. (1971). Spectra of some selfexciting and mutually exciting point processes. Biometrika, 58(1):83–90.
 Hawkes and Oakes (1974) Hawkes, A. G. and Oakes, D. (1974). A cluster process representation of a selfexciting process. Journal of Applied Probability, 11(3):493–503.
 Lewis and Mohler (2011) Lewis, E. and Mohler, G. (2011). A nonparametric em algorithm for multiscale hawkes processes. Journal of Nonparametric Statistics, 1(1):1–20.
 Linderman and Adams (2015) Linderman, S. W. and Adams, R. P. (2015). Scalable bayesian inference for excitatory point process networks. arXiv preprint arXiv:1507.03228.
 Lloyd et al. (2015) Lloyd, C., Gunter, T., Osborne, M., and Roberts, S. (2015). Variational inference for gaussian process modulated poisson processes. In ICML, pages 1814–1822.
 Rainforth et al. (2018) Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. (2018). Tighter variational bounds are not necessarily better. arXiv preprint arXiv:1802.04537.
 Rasmussen (2013) Rasmussen, J. G. (2013). Bayesian inference for hawkes processes. Methodology and Computing in Applied Probability, 15(3):623–642.
 Rizoiu et al. (2018) Rizoiu, M.A., Mishra, S., Kong, Q., Carman, M., and Xie, L. (2018). Sirhawkes: Linking epidemic models and hawkes processes to model diffusions in finite populations. In Proceedings of the 27th International Conference on World Wide Web, pages 419–428. International World Wide Web Conferences Steering Committee.
 Rubin (1972) Rubin, I. (1972). Regular point processes and their detection. IEEE Transactions on Information Theory, 18(5):547–557.
 Snelson and Ghahramani (2006) Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudoinputs. In Advances in neural information processing systems, pages 1257–1264.

Williams and Rasmussen (2006)
Williams, C. K. and Rasmussen, C. E. (2006).
Gaussian processes for machine learning
, volume 2. MIT Press Cambridge, MA.  Zhang et al. (2019) Zhang, R., Walder, C., Rizoiu, M.A., and Xie, L. (2019). Efficient Nonparametric Bayesian Hawkes Processes. In International Joint Conference on Artificial Intelligence (IJCAI’19), Macao, China.
 Zhao et al. (2015) Zhao, Q., Erdogdu, M. A., He, H. Y., Rajaraman, A., and Leskovec, J. (2015). Seismic: A selfexciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1513–1522. ACM.
 Zhou et al. (2013) Zhou, K., Zha, H., and Song, L. (2013). Learning triggering kernels for multidimensional hawkes processes. In International Conference on Machine Learning, pages 1301–1309.
Appendix A Omitted Derivations
a.1 Deriving Section 3
As per Eq. 3, there is
where the KL term can be simplified as
To utilise the likelihood , we merge the reconstruction term and the KL term w.r.t.
Thus, the ELBO is finally written as
a.2 Deriving Eq. 12
First, note that the branching structure has the tree structure and is a collection of triggering relationships between paired points. Each point has a parent (, represents the background) and the triggering events are independent, so the probability of the branching structure is the product of the probabilities of triggering events, that is
(24) 
where is the approximate posterior distribution of the triggering event where triggering . According to the cluster representation of the Hawkes process, the branching structure decomposes the process into a cluster of Poisson process. Thus, the loglikelihood turns out to be a sum of log Poisson likelihoods:
(25) 
where for notational simplification, we introduce , for and . Furthermore, the expectation of the above loglikelihood w.r.t. is derived as below
Comments
There are no comments yet.