Learning Hawkes Processes from a Handful of Events

by   Farnood Salehi, et al.

Learning the causal-interaction network of multivariate Hawkes processes is a useful task in many applications. Maximum-likelihood estimation is the most common approach to solve the problem in the presence of long observation sequences. However, when only short sequences are available, the lack of data amplifies the risk of overfitting and regularization becomes critical. Due to the challenges of hyper-parameter tuning, state-of-the-art methods only parameterize regularizers by a single shared hyper-parameter, hence limiting the power of representation of the model. To solve both issues, we develop in this work an efficient algorithm based on variational expectation-maximization. Our approach is able to optimize over an extended set of hyper-parameters. It is also able to take into account the uncertainty in the model parameters by learning a posterior distribution over them. Experimental results on both synthetic and real datasets show that our approach significantly outperforms state-of-the-art methods under short observation sequences.



There are no comments yet.


page 1

page 2

page 3

page 4


Automatic Hyper-Parameter Optimization Based on Mapping Discovery from Data to Hyper-Parameters

Machine learning algorithms have made remarkable achievements in the fie...

Shadow Estimation Method for "The Episolar Constraint: Monocular Shape from Shadow Correspondence"

Recovering shadows is an important step for many vision algorithms. Curr...

Robust and Accurate Superquadric Recovery: a Probabilistic Approach

Interpreting objects with basic geometric primitives has long been studi...

Maximum likelihood estimation of hidden Markov models for continuous longitudinal data with missing responses and dropout

We propose an inferential approach for maximum likelihood estimation of ...

Hyper-SAGNN: a self-attention based graph neural network for hypergraphs

Graph representation learning for hypergraphs can be used to extract pat...

Fast Estimation of Causal Interactions using Wold Processes

We here focus on the task of learning Granger causality matrices for mul...

A Rigorously Bayesian Beam Model and an Adaptive Full Scan Model for Range Finders in Dynamic Environments

This paper proposes and experimentally validates a Bayesian network mode...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real-world applications, including finance, computational biology, social-network studies, criminology, and epidemiology, it is important to gain insight from the interactions of multivariate time series of discrete events. For example, in finance, changes in the price of a stock might affect the market Bacry et al. (2015b); and in epidemiology, individuals infected by an infectious disease might spread the disease to their neighbors Garske et al. (2017). Such networks of time series often exhibit mutually exciting patterns of diffusion. Hence, a recurring issue is to learn in an unsupervised way the causal structure of interacting networks. This task is usually tackled by defining a so-called causal graph of entities where an edge from a node to a node means that events in node depend on the history of node . Such causal interactions are typically learned with either directed information Rao et al. (2008); Quinn et al. (2015), transfer entropy Schreiber (2000), or Granger causality Arnold et al. (2007); Eichler (2012).

A widely used model for capturing mutually exciting patterns in a multi-dimensional time series is the Multivariate Hawkes process (MHP), a particular type of temporal point process where an event in one dimension can affect future arrivals in other dimensions. It has been shown that learning the excitation matrix of an MHP encodes the causal structure between the processes, both in terms of Granger causality Eichler et al. (2017) and directed information Etesami et al. (2016). Most studies focus on the scalability of MHPs to large datasets. However, in many applications, data can be very expensive to collect, or simply not available. For example, in economic and public health studies, collecting survey data is usually an expensive process. Similarly, in the case of epidemic modeling, it is critical to learn as fast as possible the patterns of diffusion of a spreading disease. As a result, the amount of data available is intrinsically limited. MHPs are known to be sensitive to the amount of data used for training, and the excitation patterns learned by MHPs from short sequences can be unreliable Xu et al. (2017). In such settings, the likelihood becomes noisy and regularization becomes critical. Nevertheless, as most hyper-parameter tuning algorithms such as grid search, random search, and even Bayesian optimization become challenging when the number of hyper-parameters is large, state-of-the-art methods only parameterize regularizers by a single shared hyper-parameter, hence limiting the power of representation of the model.

In this work, we address both the small data and hyper-parameter tuning issues by considering the parameters of the model as latent variables and by developing an efficient algorithm based on variational expectation-maximization. By estimating the evidence – rather than the likelihood – the proposed approach is able to optimize, with minimal computational complexity, over an extended set of hyper-parameters. Our approach is also able to take into account the uncertainty in the model parameters by fitting a posterior distribution over them. Therefore, rather than just providing a point estimate, this approach can provide an estimation of uncertainty on the learned causal graph. Experimental results on synthetic and real datasets show that, as a result, the proposed approach significantly outperforms state-of-the-art methods under short observation sequences, and maintains the same performance in the large-data regime.

2 Related Works

The most common approaches to learning the excitation matrix of MHPs are based on variants of regularized maximum-likelihood estimation (MLE). Zhou et al. (2013a) propose regularizers that enforce sparse and low-rank structures, along with an efficient algorithm based on the alternating-direction method of multipliers. To mitigate the parametric assumption, Xu et al. (2016) represent the excitation functions as a series of basis functions, and to achieve sparsity under this representation they propose a sparse group-lasso regularizer. Such estimation methods are often referred to as non-parametric as they enable more flexibility on the shape of the excitation functions Lemonnier and Vayatis (2014); Hansen et al. (2015)

. To estimate the excitation matrix without any parametric modeling, fully non-parametric approaches were developed 

Zhou et al. (2013b); Achab et al. (2017). However, these methods focus on scalability and target settings where large-scale datasets are available.

Bayesian methods go beyond the classic approach of MLE by enabling a probabilistic interpretation of the model parameters. A few studies tackled the problem of learning MHPs from a Bayesian perspective. Linderman and Adams (2014) use a Gibbs sampling-based approach, but the convergence of the proposed algorithm is slow. To tackle this problem, Linderman and Adams (2015) discretize the time, which introduces noise in the model. In a different setting where some of the events or dimensions are hidden, Linderman et al. (2017) use an expectation maximization algorithm to marginalize over the unseen part of the network.

Bayesian probabilistic models are usually intractable and require approximate inference. To address the issue, variational inference (VI) approximates the high-dimensional posterior of the probabilistic model. It recently gained interest in many applications. VI is used, to name a few, for word embedding Bamler and Mandt (2017); Barkan (2017), paragraph embedding Ji et al. (2017)

, and knowledge-graph embedding 

Bamler et al. (2019). For more details on this topic, we refer the reader to Zhang et al. (2018) and Blei et al. (2017). Variational inference has also proven to be a successful approach to learning hyper-parameters Bernardo et al. (2003); Bamler et al. (2019). Building on recent advances in variational inference, we develop in this work a variational expectation-maximization algorithm by interpreting the parameters of an MHP as latent variables of a probabilistic model.

3 Preliminary Definitions

3.1 Multivariate Hawkes Processes

Formally, a -dimensional MHP is a collection of univariate counting processes , , whose realization over an observation period consists of a sequence of discrete events , where is the timestamp of the -th event and is its dimension. Each process has the particular form of conditional intensity function


where is the constant exogenous part of the intensity of process , and the excitation function encodes the effect of past events from dimension on future events from dimension . The larger the values of , the more likely events in dimension will trigger events in dimension . It has been shown that the excitation matrix encodes the causal structure of the MHP in terms of Granger causality, i.e., if and only if the process does not Granger-cause process  Etesami et al. (2016); Eichler et al. (2017).

Most of the literature uses a parametric form for the excitation functions. The most popular form is the exponential excitation function


However, in most applications the excitation patterns are unknown and this form might be too restrictive. Hence, to alleviate the assumption of a particular form for the excitation function, other approaches Lemonnier and Vayatis (2014); Hansen et al. (2015); Xu et al. (2016) over-parameterize the space and encode the excitation functions as a linear combination of basis functions as


where the basis functions are often exponential or Gaussian kernels Xu et al. (2016). This kind of approach is generally referred to as non-parametric. In the experiments of Section 5, we investigate the performance of both parametric and non-parametric approaches to learning MHPs from small sequences of observations. We denote the set of exogenous rates by and the excitation matrix by .

3.2 Maximum Likelihood Estimation

Suppose that we observe a sequence of discrete events over an observation period . The most common approach to learning the parameters of an MHP given is to perform a regularized maximum-likelihood estimation Zhou et al. (2013a); Bacry et al. (2015a); Xu et al. (2016), which amounts to minimizing an objective function that is the sum of the negative log-likelihood and of a penalty term that induces some desired structural properties. Specifically, the objective is to solve the optimization problem


where the log-likelihood of the parameters is given by


The particular choice of penalty , along with the single hyper-parameter controlling its influence, depends on the problem at hand. For example, a necessary condition to ensure that the learned model is stable is that and that the spectral radius of the excitation matrix is less than  Daley and Vere-Jones (2003). Hence, a common penalty used is


with or in Zhou et al. (2013b, a); Xu et al. (2016). Another common assumption is that the graph is sparse. In this case, a Group-Lasso penalty of the form


is commonly used to enforce sparsity in the excitation functions Xu et al. (2016).

Small data amplifies the danger of overfitting; hence the choice of regularizers and their hyper-parameters becomes essential. Nevertheless, to control the influence of the penalty in (4), all state-of-the-art methods are limited by the use of a single shared hyper-parameter . Ideally, we would have a different hyper-parameter to independently control the effect of the penalty on each parameter of the model. However, the number of parameters, i.e., , grows quadratically with the dimension of the problem . To make matters worse, the most common approaches used to fine-tune the choice of hyper-parameters, i.e., grid search and random search, become computationally prohibitive when the number of hyper-parameters becomes large. Indeed, the search space exponentially increases with the number of hyper-parameters. Another approach is to use Bayesian optimization of hyper-parameters, but the cost of doing this also becomes prohibitive as the number of samples required to learn the landscape of cost function exponentially increases with the number of hyper-parameters Snoek et al. (2012). We describe the details of our proposed approach in the next section.

4 Proposed Learning Approach

We now introduce the proposed approach for learning MHPs. The approach enables us to use different hyper-parameters for each model parameter and efficiently tune them all by taking into account parameter uncertainty. It is based on the variational expectation-maximization (EM) algorithm and jointly optimizes both the model parameters and , as well as the hyper-parameters .

First, we can view regularized MLE as a maximum a posteriori (MAP) estimator of the model where parameters are considered as latent variables. Under this interpretation, regularizers on the model parameters correspond to unnormalized priors on the latent variables. The optimization problem becomes


Therefore, having a better regularizer means having a better prior. In the presence of a long sequence of observations, we want the prior to be as uninformative as possible (i.e., a smaller regularization) as we have access to enough information for the MLE to accurately estimate the parameters of the model. But in the case where we only observe short sequences, we want to use more informative priors to avoid overfitting (i.e., a larger regularization).

Unfortunately, the MAP estimator cannot adjust the influence of the prior by optimizing over . Indeed, the cost function in (8) is unbounded from above and solving Equation (8) with respect to leads trivially to a divergent solution . To address this issue, we can take a Bayesian approach, integrate out parameters and optimize the evidence (or marginal likelihood) instead of the log-likelihood. Such an approach changes the optimization problem of Equation (8) into


Unlike the MAP objective function, maximizing the evidence over

does not lead to a degenerate solution because it is upper bounded by the likelihood. However, this optimization problem can be solved only for simple models where the integral has a closed form, which requires a conjugate prior to the likelihood. Therefore, we use variational inference to estimate the evidence and develop a variational EM algorithm to optimize our objective with respect to


4.1 Variational Expectation-Maximization for Multivariate Hawkes Processes

Variational inference.

The derivation of the variational objective is as follows. First postulate a variational distribution , parameterized by the variational parameters , approximating the posterior . The variational parameters

are chosen such that the Kullback–Leibler divergence between the true posterior

and the variational distribution is minimized. It is known that minimizing is equivalent to maximizing the evidence lower-bound (ELBO) Blei et al. (2017); Zhang et al. (2018) defined as


By invoking Jensen’s inequality on the integral , we obtain the desired lower bound on the evidence where, by maximizing with respect to , the bound becomes tighter.

For simplicity, we adopt the mean-field assumption by choosing a variational distribution that factorizes over the latent variables111This assumption can be relaxed using more advanced techniques at the cost of having a higher computational complexity. . As the parameters and

of an MHP are non-negative, a good candidate to approximate the posterior is a log-normal distribution. We define the variational parameters

as the mean and the standard deviation of

. We denote the standard deviation by because, we optimize its log to naturally ensure its positivity and the stability of the optimization procedure. Although we present our learning approach for the log-normal distribution, it is easily generalizable to other distributions.

Variational EM algorithm.

In order to efficiently optimize the ELBO with respect to both the variational parameters and the hyper-parameters , we use the variational EM algorithm that iterates over the two following steps: The E-step maximizes the ELBO with respect to the variational parameters in order to get a tighter lower-bound on the evidence; and the M-step updates the hyper-parameters with a closed form update. Details of the two steps are as follows.

The E-step maximizes the ELBO with respect to the variational parameters to make the variational distribution close to the exact posterior and to ensure that the ELBO is a good proxy for the evidence. To evaluate the ELBO, we use the black-box variational-inference optimization in Kingma and Welling (2014); Rezende et al. (2014). Re-parameterize the model as

where (resp. ) has the same shape as (resp. ), with each element following a normal distribution . denotes the element-wise product. This trick enables us to rewrite the first intractable expectation term of the ELBO in (10) as


The second term of the ELBO in (10) is the entropy of the log-normal distribution that can be expressed, up to a constant, as . The ELBO then can be estimated by Monte-Carlo integration as


where is the number of Monte-Carlo samples . Note that the first term of (12) is the cost function for the MAP problem (8) evaluated at for . Hence, the E-step summarizes into maximizing the right-hand side of (12) with respect to using gradient descent.

In the M-step, the ELBO is used as a proxy for the evidence and is maximized with respect to the hyper-parameters

. Again, we rely on the re-parameterization technique and compute the unbiased estimate of the ELBO in (

12). The maximum of the estimate (12) with respect to has a closed form that depends on the choice of prior. We provide the closed-form solutions in Appendix A for a few proposed priors that emulate common regularizers. To avoid fast changes in

due to the variance of the Monte-Carlo integration, we take an update similar to the one in 

Bamler et al. (2019) and take a weighted average between the current estimate and the minimizer of the current Monte-Carlo estimate of the ELBO as


where is the momentum term.

Algorithm 1 summarizes the proposed variational EM approach.222Source code is available publicly. The computational complexity of the inner-most loop of Algorithm 1 is times the complexity of an iteration of gradient descent on the log-likelihood. However, as observed by recent studies in variational inference, using is usually sufficient in many applications Kingma and Welling (2014). Hence, we use in all our experiments, leading to the same computational complexity per-iteration as MLE using gradient descent.

1: Sequence of observations . Initial values for and . Momentum term . Sample size of Monte-Carlo integrations. Number of iterations and of E-steps and EM-steps. Learning rate .
2:for  do
3:     for  do E step
4:         Sample Gaussian noise
5:         Evaluate the ELBO using Equation (12)
6:         Update .
7:         Update .
8:     end for
9:     Sample Gaussian noise . M step
10:     Update using Equation (13).
11:end for
Algorithm 1 Variational EM algorithm for Multivariate Hawkes Processes

5 Experimental Results

We carry out two sets of experiments. First, we perform a link-prediction task on synthetic data to show that our approach can accurately recover the support of the excitation matrix of the MHP under short sequences. Second, we perform an event-prediction task on real datasets of short sequences to show that our approach outperforms state-of-the-art methods in terms of predictive log-likelihood.

We run our experiments in two different settings. First, in a parametric setting where the exponential form of the excitation function is known, we compare our approach (VI-EXP) to the state-of-the-art MLE-based method (MLE-ADM4) from Zhou et al. (2013a). Second, we use a non-parametric setting where no assumption is made on the shape of the excitation function. We then set the excitation function as a mixture of Gaussian kernels defined as


where and are the known location and scale of the kernel. In this setting, we compare our approach (VI-SG) to the state-of-the-art MLE-based methods (MLE-SGLP) of Xu et al. (2016) with the same 333 We also performed the experiments with other approaches designed for large-scale datasets, but their performance was below that of the reported baselines Achab et al. (2017); Linderman and Adams (2014, 2015). . Let us stress that the parametric methods have a strong advantage over the non-parametric ones because they are given the true value of the exponential decay .

As our VI approach returns a posterior on the parameters, rather than a point estimate, we use the mode of the approximate log-normal posterior as the inferred edges . For the non-parametric setting, we use . To mimic the regularization schemes of the baselines, we use a Laplacian prior for the edge weights to enforce sparsity, and we use a Gaussian prior for the baselines . We tune the hyper-parameters of the baselines using grid search444More details are provided in Appendix E..

5.1 Synthetic Data

We first evaluate the performance of our VI approach on simulated data. We generate random Erdős–Rényi graphs with

nodes and edge probability

. Then, a sequence of observations is generated from an MHP with exponential excitation functions defined in (2) with exponential decay . The baselines are sampled independently in , and the edge weights are sampled independently in . Results are averaged over graphs with simulations each. For reproducibility, a detailed description of the experimental setup is provided in Appendix E.

To investigate if the support of the excitation matrix can be accurately recovered under small data, we evaluate the performance of each approach on three metrics Zhou et al. (2013b); Xu et al. (2016); Figueiredo et al. (2018).

  • [leftmargin=*]

  • F1-score. We zero-out small weights using a threshold and measure the F1-score of the resulting binary edge classification problem555Additional results with varying thresholds are provided in Appendix D..

  • Precision@k. Instead of thresholding, we also report the precision@k defined by the average fraction of correctly identified edges in the top largest estimated weights. Since the proposed VI approach gives an estimate of uncertainty via the variance of the posterior, we select the edges with high weights and low uncertainty, i.e., the edges with ratio of lowest standard deviation over weight .

  • Relative error. To evaluate the distance of the estimated weights to the ground truth ones, we use the averaged relative error defined as when , and when . This metric is more sensitive to errors in small weights , and therefore penalizes false positive over false negative errors.

Figure 1: Performance measured by (a) F1-Score, (b) Precision@20, and (c) Relative error with respect to the number of training samples. Our VI approaches are shown in solid lines. The non-parametric methods are highlighted with square markers. Results are averaged over random graphs with simulations each ( standard deviation).

We investigate the sensitivity of each approach to the amount of data available for training by varying the size of the training set from to events, i.e., to events per node. Results are shown in Figure 1. Our approach improves the results in both parametric and non-parametric settings for all metrics. The improvements are more substantial in the non-parametric setting. If the accuracy of the top edges is similar for both VI-SG and MLE-SGLP in terms of precision@20, VI-SG improves the F1-score by about with training events. The reason for this improvement is that MLE-SGLP has a much higher false positive rate, which is hurting the F1-score but does not affect the precision@20. VI-SG is also able to reach the same F1-score as the parametric baseline MLE-ADM4 with only training events666We present additional results with various thresholds and in Appendix D.. Note that VI-SG is optimizing hyper-parameters with minimal additional cost.

Figure 2: Analysis of the robustness of non-parametric approaches to the number of bases of excitation functions (for fixed ).
Figure 3: Analysis of the uncertainty of the parameters learned by VI-EXP (for fixed ). (a) Uncertainty of the inferred edges and (b) histogram of learned . The learned are smaller for non-edges, and false positive edges have higher uncertainty than the true positive ones.

In the next experiment, we focus on the non-parametric setting, we fix the length of observation to and study the effect of increasing on the performance of the algorithms. The results are shown in Figure 2. We see that our approach is more robust to the choice of than MLE-SGLP. A possible explanation for this behavior is that MLE-SGLP overfits due to the increasing number of model parameters.

Finally, we investigate the parameters of the model learned by our VI-EXP approach. In Figure 3a, we use the variance of the approximated posterior as a measure of confidence for edge identification, and we report the distribution of ratio of standard deviation over weight for both the true and false positive edges. Similar results hold between the true and false negative edges. The false positive edges have a higher uncertainty than the true positive ones. This is relevant when we cannot identify all edges due to lack of data, even though we still wish to identify a subset of edges with high confidence. In addition, Figure 3b confirms that, as expected, the optimized weight priors are much larger for true edges in the ground-truth excitation matrix than for non-edges.

5.2 Real Data

We also evaluate the performance of our approach on the following three small datasets:

  1. [leftmargin=*]

  2. Epidemics. This dataset contains records of infection of individuals, along with their corresponding district of residence, during the last Ebola epidemic in West Africa in 2014-2015 Garske et al. (2017). To learn the propagation network of the epidemics, we consider the 54 districts as processes and define infection records as events.

  3. Stock market. This dataset contains the stock prices of high-tech companies sampled every minutes on the New York Stock Exchange for days in April 2008 Etesami et al. (2016). We consider each stock as a process and record an event every time a stock price changes by from its current value.

  4. Enron email. This dataset contains emails between employees of Enron from the Enron corpus. We consider all employees with more than received emails as processes and record an event every time an employee receives an email.

We perform an event-prediction task to show that our approach outperforms the state-of-the-art methods in terms of predictive log-likelihood. To do so, we use the first 70% events as training set, and we compute the held-out averaged log-likelihood on the remaining 30%. We present the results in Table 1.

We first see that the non-parametric methods outperform the parametric ones on both the Epidemic dataset and the Stock market dataset. This suggests that the exponential excitation function might be too restrictive to fit their excitation patterns. In addition, our non-parametric approach VI-SG significantly outperforms MLE-SGLP on all datasets. The improvement is particularly clear for the Epidemic dataset, which has the smallest number of events per dimension. Indeed, the top edges learned by VI-SG correspond to contiguous districts as expected. This is not the case for MLE-SGLP, for which the top learned edges correspond to districts that are far from each other.

Dataset Statistics Averaged predictive log-likelihood
#dim () #events () VI-SG MLE-SGLP VI-EXP MLE-ADM4
Stock market
Enron email
Table 1: Predictive log-likelihood for the models learned on various real datasets.

6 Conclusion

We proposed a novel approach to learn the excitation matrix of a multivariate Hawkes process in the presence of short observation sequences. We observed that state-of-the-art methods are sensitive to the amount of data used for training and showed that the proposed approach outperforms these methods when only short training sequences are available. The common tool to tackle this problem is to design smarter regularization schemes. However, all maximum likelihood-based methods suffer from a common problem: all the model parameters are regularized equally with a few hyper-parameters. We developed a variational expectation maximization algorithm that is able to (1) optimize over an extended set of hyper-parameters, with almost no additional cost and (2) take into account the uncertainty of the learned model parameters by fitting a posterior distribution over them. We performed experiments on both synthetic and real datasets and showed that our approach outperforms state-of-the-art methods under small-data regimes.


We would like to thank Negar Kiyavash and Jalal Etesami for the valuable discussions and insightful feedback at the early stage of this work. The work presented in this paper was supported in part by the Swiss National Science Foundation under grant number 200021-182407.


  • M. Achab, E. Bacry, S. Gaïffas, I. Mastromatteo, and J. Muzy (2017) Uncovering causality from multivariate Hawkes integrated cumulants.

    The Journal of Machine Learning Research

    18 (1), pp. 6998–7025.
    Cited by: §2, footnote 3.
  • A. Arnold, Y. Liu, and N. Abe (2007) Temporal causal modeling with graphical granger methods. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, New York, NY, USA, pp. 66–75. External Links: ISBN 978-1-59593-609-7, Link, Document Cited by: §1.
  • E. Bacry, S. Gaïffas, and J. Muzy (2015a) A generalization error bound for sparse and low-rank multivariate Hawkes processes. arXiv preprint arXiv:1501.00725. External Links: 1501.00725 Cited by: §3.2.
  • E. Bacry, I. Mastromatteo, and J. Muzy (2015b) Hawkes processes in finance. Market Microstructure and Liquidity 1 (01), pp. 1550005. Cited by: §1.
  • R. Bamler and S. Mandt (2017) Dynamic word embeddings. In Proceedings of the 34th international conference on Machine learning, Cited by: §2.
  • R. Bamler, F. Salehi, and S. Mandt (2019) Augmenting and tuning knowledge graph embeddings. In

    Proceedings of the Conference on Uncertainty in Artificial Intelligence

    Cited by: §2, §4.1.
  • O. Barkan (2017) Bayesian neural word embedding.. In AAAI, pp. 3135–3143. Cited by: §2.
  • J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, M. West, et al. (2003) The variational bayesian em algorithm for incomplete data: with application to scoring graphical model structures. Bayesian statistics 7, pp. 453–464. Cited by: §2.
  • D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. Cited by: §2, §4.1.
  • D. J. Daley and D. Vere-Jones (2003) An introduction to the theory of point processes. Vol. I. Second edition, Probability and its Applications (New York), Springer-Verlag, New York. Note: Elementary theory and methods External Links: ISBN 0-387-95541-0, MathReview (Volker Schmidt) Cited by: §3.2.
  • M. Eichler, R. Dahlhaus, and J. Dueck (2017) Graphical modeling for multivariate Hawkes processes with nonparametric link functions. Journal of Time Series Analysis 38 (2), pp. 225–242. Cited by: §1, §3.1.
  • M. Eichler (2012) Graphical modelling of multivariate time series. Probability Theory and Related Fields 153 (1), pp. 233–268. External Links: ISSN 1432-2064, Document, Link Cited by: §1.
  • J. Etesami, N. Kiyavash, K. Zhang, and K. Singhal (2016) Learning network of multivariate Hawkes processes: a time series approach. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, Arlington, Virginia, United States, pp. 162–171. External Links: ISBN 978-0-9966431-1-5, Link Cited by: §1, §3.1, item 2.
  • F. Figueiredo, G. Borges, P. O. S. V. de Melo, and R. Assunção (2018) Fast estimation of causal interactions using Wold processes. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 2975–2986. External Links: Link Cited by: §5.1.
  • T. Garske, A. Cori, A. Ariyarajah, I. M. Blake, I. Dorigatti, T. Eckmanns, C. Fraser, W. Hinsley, T. Jombart, H. L. Mills, G. Nedjati-Gilani, E. Newton, P. Nouvellet, D. Perkins, S. Riley, D. Schumacher, A. Shah, M. D. V. Kerkhove, C. Dye, N. M. Ferguson, and C. A. Donnelly (2017) Heterogeneities in the case fatality ratio in the West African ebola outbreak 2013-2016. Philosophical Transactions of the Royal Society B: Biological Sciences 372 (1721), pp. 20160308. External Links: Document, Link, https://royalsocietypublishing.org/doi/pdf/10.1098/rstb.2016.0308 Cited by: §1, item 1.
  • N. R. Hansen, P. Reynaud-Bouret, and V. Rivoirard (2015) Lasso and probabilistic inequalities for multivariate point processes. Bernoulli 21 (1), pp. 83–143. Note: 61 pages External Links: Link, Document Cited by: §2, §3.1.
  • G. Ji, R. Bamler, E. B. Sudderth, and S. Mandt (2017)

    Bayesian paragraph vectors


    Symposium on Advances in Approximate Bayesian Inference

    Cited by: §2.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), Cited by: §4.1, §4.1.
  • R. Lemonnier and N. Vayatis (2014) Nonparametric markovian learning of triggering kernels for mutually exciting and mutually inhibiting multivariate Hawkes processes. In Machine Learning and Knowledge Discovery in Databases, T. Calders, F. Esposito, E. Hüllermeier, and R. Meo (Eds.), Berlin, Heidelberg, pp. 161–176. Cited by: §2, §3.1.
  • S. Linderman and R. Adams (2014) Discovering latent network structure in point process data. In International Conference on Machine Learning, pp. 1413–1421. Cited by: §2, footnote 3.
  • S. W. Linderman and R. P. Adams (2015) Scalable Bayesian inference for excitatory point process networks. arXiv preprint arXiv:1507.03228. Cited by: §2, footnote 3.
  • S. W. Linderman, Y. Wang, and D. M. Blei (2017) Bayesian inference for latent Hawkes processes. NeurIPS Symposium on Advances in Approximate Bayesian Inference Probabilistic. Cited by: §2.
  • C. J. Quinn, N. Kiyavash, and T. P. Coleman (2015) Directed information graphs. IEEE Transactions on Information Theory 61 (12), pp. 6887–6909. External Links: Document, ISSN 0018-9448 Cited by: §1.
  • A. Rao, A. O. Hero, D. J. States, and J. D. Engel (2008) USING directed information to build biologically relevant influence networks. Journal of Bioinformatics and Computational Biology 06 (03), pp. 493–519. External Links: Document Cited by: §1.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In Proceedings of the 31st International Conference on Machine Learning, Cited by: §4.1.
  • T. Schreiber (2000) Measuring information transfer. Phys. Rev. Lett. 85, pp. 461–464. External Links: Document, Link Cited by: §1.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §3.2.
  • H. Xu, M. Farajtabar, and H. Zha (2016) Learning granger causality for Hawkes processes. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1717–1726. External Links: Link Cited by: 4th item, §2, §3.1, §3.2, §5.1, §5.
  • H. Xu, D. Luo, and H. Zha (2017) Learning Hawkes processes from short doubly-censored event sequences. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 3831–3840. External Links: Link Cited by: §1.
  • C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt (2018) Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §4.1.
  • K. Zhou, H. Zha, and L. Song (2013a) Learning social infectivity in sparse low-rank networks using multi-dimensional Hawkes processes.. In AISTATS, JMLR Workshop and Conference Proceedings, Vol. 31, pp. 641–649. Cited by: 3rd item, §2, §3.2, §5.
  • K. Zhou, H. Zha, and L. Song (2013b) Learning triggering kernels for multi-dimensional Hawkes processes. In International Conference on Machine Learning, Vol. 28, pp. 1301–1309. Cited by: §2, §3.2, §5.1.

Appendix A Different Priors

In this section, we provide the probabilistic interpretation as a prior of several regularizers commonly used in the literature.

  • [leftmargin=*]

  • -regularizer: Perhaps the most commonly used regularizer in MHPs is the -regularizer .

    -regularizer can be interpreted as a zero-mean Gaussian distribution over the weights,


    where is the variance.

  • -regularizer: This regularizer (also known as lasso regularizer) is considered as a convex surrogate for the (pseudo) norm. Hence, it promotes the sparsity of the parameters. It can be interpreted as a Laplace distribution over the weights, i.e.,

  • Low-rank regularizer: To achieve a low-rank excitation matrix , a nuclear norm penalty on is often used as a regularizer Zhou et al. [2013a], thus enabling clustering structures in . For an excitation matrix , let , then the different are independent for different and the prior over is

    where is a constant.

  • Group-lasso regularizer: This regularizer is used in Xu et al. [2016] in the non-parametric setting defined in Section 3 where the excitation function is approximated by a linear combination of basis functions, parameterized by . In this case, the -norm of is assumed to have a Laplace distribution, i.e.,

    where is a constant.

Appendix B Hyper-parameter Update

Below, we give a closed-form solution to update in (13) for two priors used in the experiments of Section 5

. The update rules for the other priors are similar. We start by rewriting the joint distribution



where the first term (the likelihood) is not a function of and only the second term (the prior) is a function of . Hence minimizing over amounts to minimizing .


For the -regularizer we have

and the closed form update hence becomes



For the -regularizer we have

and the closed form update hence becomes


Appendix C Simple Optimization of

In this section, we show that we cannot simply find by optimizing the negative log-likelihood in (4) or the MAP objective in (8) over .

Fist note that, minimizing regularized negative log-likelihood in (4) over , simply sets to infinity.

Second, we show that maximizing the MAP objective in (8) over also fails because it is unbounded from above. We show this for the case of the Gaussian prior defined by


but the same result holds for other priors. The log of the Gaussian prior (18) is


where is a constant independent of . In the MAP objective (8), if we set and , i.e., all processes are simple Poisson process with rate 1 and no interaction between them, then the conditional intensity for all and . The log-likelihood in (5) becomes , which is bounded from below. Set , then for , we get . Hence, the MAP estimator for is unbounded from above and maximizing the MAP objective simultaneously over both the hyper-parameters and the model parameters and would fail.

Appendix D Additional Experimental Results

We first carry out an additional set of experiments to evaluate the effect of zeroing-out small weights using a threshold . To do so, we first need to introduce the following two performance metrics:

  • [leftmargin=*]

  • The false positive rate (FPR) to be the fraction of errors in learnt edges

    where denotes the cardinality of a set.

  • Similarly, the false negative rate (FNR) to be the fraction of errors in learnt non-edges

Figure 4 shows the effect of number of samples on F1-score for several choice of threshold . We see that our proposed algorithm VI-EXP (resp. VI-SG) outperform its MLE counterpart MLE-ADM4 (resp. MLE-SGLP) for all values of . With increasing , we see that the F1-score of MLE-based approaches improve. This is due to the FPR decreasing faster than the FNR increases due to the sparsity of the graph. However note that, since we do not know the expected value of true edges beforehand, it is not clear a-priori what value we should set for the threshold . Ideally, we choose the threshold to be as small as possible, which is the regime in which our variational inference algorithm outperforms MLE-based methods the most.

In Figure 5, we plot Precision@ for different values of . The number of edges in the generated synthetic graphs is 195, so in Figure 5 we vary up to 195. We see that VI-EXP always has better Precision@ than its counterpart MLE-ADM4. VI-SG has the same Precision@ for , 10, and 20 as MLE-SGLP. For larger MLE-SGLP has slightly better Precision@. Note that, Precision@ focuses only on the accuracy of top

edges learnt by an algorithm and hence does not discriminate the imbalance between precision and recall for large

in sparse graphs.

Finally, to evaluate the scalability of our approach, we ran additional simulations on increasingly large-dimensional problems. As shown in Figure 7, the per-iteration running time of our approach VI-EXP (implemented in python) scales better than the one of MLE-ADM4 (implemented in C++). In addition, even if our gradient descent algorithm requires more iterations to converge, we show in Figure 7 that VI-EXP reaches the same F1-score as MLE-ADM4 faster.

Figure 4: Performance measured by F1-Score with respect to the number of training samples. The proposed variational inference approaches are shown in solid lines. The non-parametric methods are highlighted with square markers.
Figure 5: Performance measured by Precision@ with respect to the number of training samples. The proposed variational inference approaches are shown in solid lines. The non-parametric methods are highlighted with square markers.
Figure 6: Comparison of running time per-iteration.
Figure 7: Running time required for our approach VI-EXP to reach the same F1-Score as MLE-ADM4.

Appendix E Reproducibility

In this section, we provide extensive details on the experimental setup used in Section 5. We first describe the implementation details of the algorithm described in Algorithm 1. We then provide the details of the experimental setup for both the synthetic and real data experiments.

e.1 Implementation details of Algorithm 1

We used sampled Gaussian noise in line 4 of Algorithm 1. We set the momentum term in (13). In our early experiments, we observed that the performance of the algorithm is not sensitive to the momentum term for . Therefore, we decided to set it to in all experiments. We used the Adam optimizer with learning rate . We also multiply the learning rate by at each iteration. Both and were initialized by sampling from the normal distribution . We initialized for all hyper-parameters. We observed that the performance of the algorithm is not sensitive to the initialization. Both and were initialized by sampling from the normal distribution then clipping them to be in . This initialization ensures that the initial variance of the algorithm is neither small nor too big.

e.2 Synthetic experiments

To create the synthetic data, we generated random Erdős–Rényi graphs with nodes and with edge probability , leading to graphs with edges on average. Then, the sequences of observations were generated from an MHP with the exponential excitation kernel defined in (2). The baseline were sampled uniformly at random in , and the edge weights were sampled uniformly at random in . To generate the results of Figure 1, we varied the length of observations between and . The results were averaged over graphs with simulations each.

We used tick777https://github.com/X-DataInitiative/tick Python library to run the MLE-based baseline approaches. To tune the hyper-parameters of the MLE-based approaches, we first manually searched for an initial range of parameters where the algorithm performed well. Then, we fined-tuned the hyper-parameters using grid-search to find the ones giving the best results for the Precision@20 and F1-score metrics. For MLE-SGLP, we used the grid range and lasso_grouplasso_ratio . We used the default values for the optimizer, which we checked and are sure of its convergence. We finally chose and lasso_grouplasso_ratio . For MLE-ADM4, we also used the grid range and . Making overall 21 different configurations. We finally chose and that gave the best results for Precision@20 and F1-score.

e.3 Real data experiments

For our approach VI-EXP and its parametric counterpart VI-SG, the exponential decay parameter must be tuned for each dataset. As expected, both algorithms performed best with the same decay. For our approach VI-SG and its MLE counterpart MLE-SGLP, there are two parameters to tune, and cutoff time . The center of the -th Gaussian kernel, with , is defined as and its scale is defined as in (3). After manually finding an initial range of and where algorithms performed well, we then fine-tuned them using the grid-search.

Epidemic dataset.

For our VI-SG algorithm, we did a grid-search with and . We did not see a notable difference between the performance of different grids, as long as and are large enough. We chose and . For the baseline MLE-SGLP, we did a grid-search with , and , that makes overall 352 experiments. We chose , and . For our algorithm VI-EXP, we tried and we chose . For the baseline MLE-ADM4, we did a grid-search with and . We chose and .

Stock market dataset.

In the stock market dataset, our algorithm VI-SG also performed better with a larger . As for large the experiments are slow we decided to set and did grid-search for with . For the baseline MLE-SGLP, we did a grid-search with , and . The best values found were , and . For our algorithm VI-EXP, we tried and we chose . For the baseline MLE-ADM4, we did a grid search with and . We chose and .

Enron email dataset.

The Enron email dataset is a larger dataset and experiments are more computationally intensive, so we chose smaller ranges for hyper-parameter tuning. For our algorithm VI-SG we did a grid-search with and . The best value is . For the baseline MLE-SGLP, we did a grid-search with , and . The best value is , and . For our algorithm VI-EXP, we tried and we chose . For the baseline MLE-ADM4, we did a grid-search with and . We chose and .