1 Introduction
In many realworld applications, including finance, computational biology, socialnetwork studies, criminology, and epidemiology, it is important to gain insight from the interactions of multivariate time series of discrete events. For example, in finance, changes in the price of a stock might affect the market Bacry et al. (2015b); and in epidemiology, individuals infected by an infectious disease might spread the disease to their neighbors Garske et al. (2017). Such networks of time series often exhibit mutually exciting patterns of diffusion. Hence, a recurring issue is to learn in an unsupervised way the causal structure of interacting networks. This task is usually tackled by defining a socalled causal graph of entities where an edge from a node to a node means that events in node depend on the history of node . Such causal interactions are typically learned with either directed information Rao et al. (2008); Quinn et al. (2015), transfer entropy Schreiber (2000), or Granger causality Arnold et al. (2007); Eichler (2012).
A widely used model for capturing mutually exciting patterns in a multidimensional time series is the Multivariate Hawkes process (MHP), a particular type of temporal point process where an event in one dimension can affect future arrivals in other dimensions. It has been shown that learning the excitation matrix of an MHP encodes the causal structure between the processes, both in terms of Granger causality Eichler et al. (2017) and directed information Etesami et al. (2016). Most studies focus on the scalability of MHPs to large datasets. However, in many applications, data can be very expensive to collect, or simply not available. For example, in economic and public health studies, collecting survey data is usually an expensive process. Similarly, in the case of epidemic modeling, it is critical to learn as fast as possible the patterns of diffusion of a spreading disease. As a result, the amount of data available is intrinsically limited. MHPs are known to be sensitive to the amount of data used for training, and the excitation patterns learned by MHPs from short sequences can be unreliable Xu et al. (2017). In such settings, the likelihood becomes noisy and regularization becomes critical. Nevertheless, as most hyperparameter tuning algorithms such as grid search, random search, and even Bayesian optimization become challenging when the number of hyperparameters is large, stateoftheart methods only parameterize regularizers by a single shared hyperparameter, hence limiting the power of representation of the model.
In this work, we address both the small data and hyperparameter tuning issues by considering the parameters of the model as latent variables and by developing an efficient algorithm based on variational expectationmaximization. By estimating the evidence – rather than the likelihood – the proposed approach is able to optimize, with minimal computational complexity, over an extended set of hyperparameters. Our approach is also able to take into account the uncertainty in the model parameters by fitting a posterior distribution over them. Therefore, rather than just providing a point estimate, this approach can provide an estimation of uncertainty on the learned causal graph. Experimental results on synthetic and real datasets show that, as a result, the proposed approach significantly outperforms stateoftheart methods under short observation sequences, and maintains the same performance in the largedata regime.
2 Related Works
The most common approaches to learning the excitation matrix of MHPs are based on variants of regularized maximumlikelihood estimation (MLE). Zhou et al. (2013a) propose regularizers that enforce sparse and lowrank structures, along with an efficient algorithm based on the alternatingdirection method of multipliers. To mitigate the parametric assumption, Xu et al. (2016) represent the excitation functions as a series of basis functions, and to achieve sparsity under this representation they propose a sparse grouplasso regularizer. Such estimation methods are often referred to as nonparametric as they enable more flexibility on the shape of the excitation functions Lemonnier and Vayatis (2014); Hansen et al. (2015)
. To estimate the excitation matrix without any parametric modeling, fully nonparametric approaches were developed
Zhou et al. (2013b); Achab et al. (2017). However, these methods focus on scalability and target settings where largescale datasets are available.Bayesian methods go beyond the classic approach of MLE by enabling a probabilistic interpretation of the model parameters. A few studies tackled the problem of learning MHPs from a Bayesian perspective. Linderman and Adams (2014) use a Gibbs samplingbased approach, but the convergence of the proposed algorithm is slow. To tackle this problem, Linderman and Adams (2015) discretize the time, which introduces noise in the model. In a different setting where some of the events or dimensions are hidden, Linderman et al. (2017) use an expectation maximization algorithm to marginalize over the unseen part of the network.
Bayesian probabilistic models are usually intractable and require approximate inference. To address the issue, variational inference (VI) approximates the highdimensional posterior of the probabilistic model. It recently gained interest in many applications. VI is used, to name a few, for word embedding Bamler and Mandt (2017); Barkan (2017), paragraph embedding Ji et al. (2017)
, and knowledgegraph embedding
Bamler et al. (2019). For more details on this topic, we refer the reader to Zhang et al. (2018) and Blei et al. (2017). Variational inference has also proven to be a successful approach to learning hyperparameters Bernardo et al. (2003); Bamler et al. (2019). Building on recent advances in variational inference, we develop in this work a variational expectationmaximization algorithm by interpreting the parameters of an MHP as latent variables of a probabilistic model.3 Preliminary Definitions
3.1 Multivariate Hawkes Processes
Formally, a dimensional MHP is a collection of univariate counting processes , , whose realization over an observation period consists of a sequence of discrete events , where is the timestamp of the th event and is its dimension. Each process has the particular form of conditional intensity function
(1) 
where is the constant exogenous part of the intensity of process , and the excitation function encodes the effect of past events from dimension on future events from dimension . The larger the values of , the more likely events in dimension will trigger events in dimension . It has been shown that the excitation matrix encodes the causal structure of the MHP in terms of Granger causality, i.e., if and only if the process does not Grangercause process Etesami et al. (2016); Eichler et al. (2017).
Most of the literature uses a parametric form for the excitation functions. The most popular form is the exponential excitation function
(2) 
However, in most applications the excitation patterns are unknown and this form might be too restrictive. Hence, to alleviate the assumption of a particular form for the excitation function, other approaches Lemonnier and Vayatis (2014); Hansen et al. (2015); Xu et al. (2016) overparameterize the space and encode the excitation functions as a linear combination of basis functions as
(3) 
where the basis functions are often exponential or Gaussian kernels Xu et al. (2016). This kind of approach is generally referred to as nonparametric. In the experiments of Section 5, we investigate the performance of both parametric and nonparametric approaches to learning MHPs from small sequences of observations. We denote the set of exogenous rates by and the excitation matrix by .
3.2 Maximum Likelihood Estimation
Suppose that we observe a sequence of discrete events over an observation period . The most common approach to learning the parameters of an MHP given is to perform a regularized maximumlikelihood estimation Zhou et al. (2013a); Bacry et al. (2015a); Xu et al. (2016), which amounts to minimizing an objective function that is the sum of the negative loglikelihood and of a penalty term that induces some desired structural properties. Specifically, the objective is to solve the optimization problem
(4) 
where the loglikelihood of the parameters is given by
(5) 
The particular choice of penalty , along with the single hyperparameter controlling its influence, depends on the problem at hand. For example, a necessary condition to ensure that the learned model is stable is that and that the spectral radius of the excitation matrix is less than Daley and VereJones (2003). Hence, a common penalty used is
(6) 
with or in Zhou et al. (2013b, a); Xu et al. (2016). Another common assumption is that the graph is sparse. In this case, a GroupLasso penalty of the form
(7) 
is commonly used to enforce sparsity in the excitation functions Xu et al. (2016).
Small data amplifies the danger of overfitting; hence the choice of regularizers and their hyperparameters becomes essential. Nevertheless, to control the influence of the penalty in (4), all stateoftheart methods are limited by the use of a single shared hyperparameter . Ideally, we would have a different hyperparameter to independently control the effect of the penalty on each parameter of the model. However, the number of parameters, i.e., , grows quadratically with the dimension of the problem . To make matters worse, the most common approaches used to finetune the choice of hyperparameters, i.e., grid search and random search, become computationally prohibitive when the number of hyperparameters becomes large. Indeed, the search space exponentially increases with the number of hyperparameters. Another approach is to use Bayesian optimization of hyperparameters, but the cost of doing this also becomes prohibitive as the number of samples required to learn the landscape of cost function exponentially increases with the number of hyperparameters Snoek et al. (2012). We describe the details of our proposed approach in the next section.
4 Proposed Learning Approach
We now introduce the proposed approach for learning MHPs. The approach enables us to use different hyperparameters for each model parameter and efficiently tune them all by taking into account parameter uncertainty. It is based on the variational expectationmaximization (EM) algorithm and jointly optimizes both the model parameters and , as well as the hyperparameters .
First, we can view regularized MLE as a maximum a posteriori (MAP) estimator of the model where parameters are considered as latent variables. Under this interpretation, regularizers on the model parameters correspond to unnormalized priors on the latent variables. The optimization problem becomes
(8) 
Therefore, having a better regularizer means having a better prior. In the presence of a long sequence of observations, we want the prior to be as uninformative as possible (i.e., a smaller regularization) as we have access to enough information for the MLE to accurately estimate the parameters of the model. But in the case where we only observe short sequences, we want to use more informative priors to avoid overfitting (i.e., a larger regularization).
Unfortunately, the MAP estimator cannot adjust the influence of the prior by optimizing over . Indeed, the cost function in (8) is unbounded from above and solving Equation (8) with respect to leads trivially to a divergent solution . To address this issue, we can take a Bayesian approach, integrate out parameters and optimize the evidence (or marginal likelihood) instead of the loglikelihood. Such an approach changes the optimization problem of Equation (8) into
(9) 
Unlike the MAP objective function, maximizing the evidence over
does not lead to a degenerate solution because it is upper bounded by the likelihood. However, this optimization problem can be solved only for simple models where the integral has a closed form, which requires a conjugate prior to the likelihood. Therefore, we use variational inference to estimate the evidence and develop a variational EM algorithm to optimize our objective with respect to
.4.1 Variational ExpectationMaximization for Multivariate Hawkes Processes
Variational inference.
The derivation of the variational objective is as follows. First postulate a variational distribution , parameterized by the variational parameters , approximating the posterior . The variational parameters
are chosen such that the Kullback–Leibler divergence between the true posterior
and the variational distribution is minimized. It is known that minimizing is equivalent to maximizing the evidence lowerbound (ELBO) Blei et al. (2017); Zhang et al. (2018) defined as(10) 
By invoking Jensen’s inequality on the integral , we obtain the desired lower bound on the evidence where, by maximizing with respect to , the bound becomes tighter.
For simplicity, we adopt the meanfield assumption by choosing a variational distribution that factorizes over the latent variables^{1}^{1}1This assumption can be relaxed using more advanced techniques at the cost of having a higher computational complexity. . As the parameters and
of an MHP are nonnegative, a good candidate to approximate the posterior is a lognormal distribution. We define the variational parameters
as the mean and the standard deviation of
. We denote the standard deviation by because, we optimize its log to naturally ensure its positivity and the stability of the optimization procedure. Although we present our learning approach for the lognormal distribution, it is easily generalizable to other distributions.Variational EM algorithm.
In order to efficiently optimize the ELBO with respect to both the variational parameters and the hyperparameters , we use the variational EM algorithm that iterates over the two following steps: The Estep maximizes the ELBO with respect to the variational parameters in order to get a tighter lowerbound on the evidence; and the Mstep updates the hyperparameters with a closed form update. Details of the two steps are as follows.
The Estep maximizes the ELBO with respect to the variational parameters to make the variational distribution close to the exact posterior and to ensure that the ELBO is a good proxy for the evidence. To evaluate the ELBO, we use the blackbox variationalinference optimization in Kingma and Welling (2014); Rezende et al. (2014). Reparameterize the model as
where (resp. ) has the same shape as (resp. ), with each element following a normal distribution . denotes the elementwise product. This trick enables us to rewrite the first intractable expectation term of the ELBO in (10) as
(11) 
The second term of the ELBO in (10) is the entropy of the lognormal distribution that can be expressed, up to a constant, as . The ELBO then can be estimated by MonteCarlo integration as
(12) 
where is the number of MonteCarlo samples . Note that the first term of (12) is the cost function for the MAP problem (8) evaluated at for . Hence, the Estep summarizes into maximizing the righthand side of (12) with respect to using gradient descent.
In the Mstep, the ELBO is used as a proxy for the evidence and is maximized with respect to the hyperparameters
. Again, we rely on the reparameterization technique and compute the unbiased estimate of the ELBO in (
12). The maximum of the estimate (12) with respect to has a closed form that depends on the choice of prior. We provide the closedform solutions in Appendix A for a few proposed priors that emulate common regularizers. To avoid fast changes indue to the variance of the MonteCarlo integration, we take an update similar to the one in
Bamler et al. (2019) and take a weighted average between the current estimate and the minimizer of the current MonteCarlo estimate of the ELBO as(13) 
where is the momentum term.
Algorithm 1 summarizes the proposed variational EM approach.^{2}^{2}2Source code is available publicly. The computational complexity of the innermost loop of Algorithm 1 is times the complexity of an iteration of gradient descent on the loglikelihood. However, as observed by recent studies in variational inference, using is usually sufficient in many applications Kingma and Welling (2014). Hence, we use in all our experiments, leading to the same computational complexity periteration as MLE using gradient descent.
5 Experimental Results
We carry out two sets of experiments. First, we perform a linkprediction task on synthetic data to show that our approach can accurately recover the support of the excitation matrix of the MHP under short sequences. Second, we perform an eventprediction task on real datasets of short sequences to show that our approach outperforms stateoftheart methods in terms of predictive loglikelihood.
We run our experiments in two different settings. First, in a parametric setting where the exponential form of the excitation function is known, we compare our approach (VIEXP) to the stateoftheart MLEbased method (MLEADM4) from Zhou et al. (2013a). Second, we use a nonparametric setting where no assumption is made on the shape of the excitation function. We then set the excitation function as a mixture of Gaussian kernels defined as
(14) 
where and are the known location and scale of the kernel. In this setting, we compare our approach (VISG) to the stateoftheart MLEbased methods (MLESGLP) of Xu et al. (2016) with the same ^{3}^{3}3 We also performed the experiments with other approaches designed for largescale datasets, but their performance was below that of the reported baselines Achab et al. (2017); Linderman and Adams (2014, 2015). . Let us stress that the parametric methods have a strong advantage over the nonparametric ones because they are given the true value of the exponential decay .
As our VI approach returns a posterior on the parameters, rather than a point estimate, we use the mode of the approximate lognormal posterior as the inferred edges . For the nonparametric setting, we use . To mimic the regularization schemes of the baselines, we use a Laplacian prior for the edge weights to enforce sparsity, and we use a Gaussian prior for the baselines . We tune the hyperparameters of the baselines using grid search^{4}^{4}4More details are provided in Appendix E..
5.1 Synthetic Data
We first evaluate the performance of our VI approach on simulated data. We generate random Erdős–Rényi graphs with
nodes and edge probability
. Then, a sequence of observations is generated from an MHP with exponential excitation functions defined in (2) with exponential decay . The baselines are sampled independently in , and the edge weights are sampled independently in . Results are averaged over graphs with simulations each. For reproducibility, a detailed description of the experimental setup is provided in Appendix E.To investigate if the support of the excitation matrix can be accurately recovered under small data, we evaluate the performance of each approach on three metrics Zhou et al. (2013b); Xu et al. (2016); Figueiredo et al. (2018).

[leftmargin=*]

F1score. We zeroout small weights using a threshold and measure the F1score of the resulting binary edge classification problem^{5}^{5}5Additional results with varying thresholds are provided in Appendix D..

Precision@k. Instead of thresholding, we also report the precision@k defined by the average fraction of correctly identified edges in the top largest estimated weights. Since the proposed VI approach gives an estimate of uncertainty via the variance of the posterior, we select the edges with high weights and low uncertainty, i.e., the edges with ratio of lowest standard deviation over weight .

Relative error. To evaluate the distance of the estimated weights to the ground truth ones, we use the averaged relative error defined as when , and when . This metric is more sensitive to errors in small weights , and therefore penalizes false positive over false negative errors.
We investigate the sensitivity of each approach to the amount of data available for training by varying the size of the training set from to events, i.e., to events per node. Results are shown in Figure 1. Our approach improves the results in both parametric and nonparametric settings for all metrics. The improvements are more substantial in the nonparametric setting. If the accuracy of the top edges is similar for both VISG and MLESGLP in terms of precision@20, VISG improves the F1score by about with training events. The reason for this improvement is that MLESGLP has a much higher false positive rate, which is hurting the F1score but does not affect the precision@20. VISG is also able to reach the same F1score as the parametric baseline MLEADM4 with only training events^{6}^{6}6We present additional results with various thresholds and in Appendix D.. Note that VISG is optimizing hyperparameters with minimal additional cost.
In the next experiment, we focus on the nonparametric setting, we fix the length of observation to and study the effect of increasing on the performance of the algorithms. The results are shown in Figure 2. We see that our approach is more robust to the choice of than MLESGLP. A possible explanation for this behavior is that MLESGLP overfits due to the increasing number of model parameters.
Finally, we investigate the parameters of the model learned by our VIEXP approach. In Figure 3a, we use the variance of the approximated posterior as a measure of confidence for edge identification, and we report the distribution of ratio of standard deviation over weight for both the true and false positive edges. Similar results hold between the true and false negative edges. The false positive edges have a higher uncertainty than the true positive ones. This is relevant when we cannot identify all edges due to lack of data, even though we still wish to identify a subset of edges with high confidence. In addition, Figure 3b confirms that, as expected, the optimized weight priors are much larger for true edges in the groundtruth excitation matrix than for nonedges.
5.2 Real Data
We also evaluate the performance of our approach on the following three small datasets:

[leftmargin=*]

Epidemics. This dataset contains records of infection of individuals, along with their corresponding district of residence, during the last Ebola epidemic in West Africa in 20142015 Garske et al. (2017). To learn the propagation network of the epidemics, we consider the 54 districts as processes and define infection records as events.

Stock market. This dataset contains the stock prices of hightech companies sampled every minutes on the New York Stock Exchange for days in April 2008 Etesami et al. (2016). We consider each stock as a process and record an event every time a stock price changes by from its current value.

Enron email. This dataset contains emails between employees of Enron from the Enron corpus. We consider all employees with more than received emails as processes and record an event every time an employee receives an email.
We perform an eventprediction task to show that our approach outperforms the stateoftheart methods in terms of predictive loglikelihood. To do so, we use the first 70% events as training set, and we compute the heldout averaged loglikelihood on the remaining 30%. We present the results in Table 1.
We first see that the nonparametric methods outperform the parametric ones on both the Epidemic dataset and the Stock market dataset. This suggests that the exponential excitation function might be too restrictive to fit their excitation patterns. In addition, our nonparametric approach VISG significantly outperforms MLESGLP on all datasets. The improvement is particularly clear for the Epidemic dataset, which has the smallest number of events per dimension. Indeed, the top edges learned by VISG correspond to contiguous districts as expected. This is not the case for MLESGLP, for which the top learned edges correspond to districts that are far from each other.
Dataset  Statistics  Averaged predictive loglikelihood  

#dim ()  #events ()  VISG  MLESGLP  VIEXP  MLEADM4  
Epidemics  
Stock market  
Enron email 
6 Conclusion
We proposed a novel approach to learn the excitation matrix of a multivariate Hawkes process in the presence of short observation sequences. We observed that stateoftheart methods are sensitive to the amount of data used for training and showed that the proposed approach outperforms these methods when only short training sequences are available. The common tool to tackle this problem is to design smarter regularization schemes. However, all maximum likelihoodbased methods suffer from a common problem: all the model parameters are regularized equally with a few hyperparameters. We developed a variational expectation maximization algorithm that is able to (1) optimize over an extended set of hyperparameters, with almost no additional cost and (2) take into account the uncertainty of the learned model parameters by fitting a posterior distribution over them. We performed experiments on both synthetic and real datasets and showed that our approach outperforms stateoftheart methods under smalldata regimes.
Acknowledgments
We would like to thank Negar Kiyavash and Jalal Etesami for the valuable discussions and insightful feedback at the early stage of this work. The work presented in this paper was supported in part by the Swiss National Science Foundation under grant number 200021182407.
References

Uncovering causality from multivariate Hawkes integrated cumulants.
The Journal of Machine Learning Research
18 (1), pp. 6998–7025. Cited by: §2, footnote 3.  Temporal causal modeling with graphical granger methods. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, New York, NY, USA, pp. 66–75. External Links: ISBN 9781595936097, Link, Document Cited by: §1.
 A generalization error bound for sparse and lowrank multivariate Hawkes processes. arXiv preprint arXiv:1501.00725. External Links: 1501.00725 Cited by: §3.2.
 Hawkes processes in finance. Market Microstructure and Liquidity 1 (01), pp. 1550005. Cited by: §1.
 Dynamic word embeddings. In Proceedings of the 34th international conference on Machine learning, Cited by: §2.

Augmenting and tuning knowledge graph embeddings.
In
Proceedings of the Conference on Uncertainty in Artificial Intelligence
, Cited by: §2, §4.1.  Bayesian neural word embedding.. In AAAI, pp. 3135–3143. Cited by: §2.
 The variational bayesian em algorithm for incomplete data: with application to scoring graphical model structures. Bayesian statistics 7, pp. 453–464. Cited by: §2.
 Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. Cited by: §2, §4.1.
 An introduction to the theory of point processes. Vol. I. Second edition, Probability and its Applications (New York), SpringerVerlag, New York. Note: Elementary theory and methods External Links: ISBN 0387955410, MathReview (Volker Schmidt) Cited by: §3.2.
 Graphical modeling for multivariate Hawkes processes with nonparametric link functions. Journal of Time Series Analysis 38 (2), pp. 225–242. Cited by: §1, §3.1.
 Graphical modelling of multivariate time series. Probability Theory and Related Fields 153 (1), pp. 233–268. External Links: ISSN 14322064, Document, Link Cited by: §1.
 Learning network of multivariate Hawkes processes: a time series approach. In Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence, UAI’16, Arlington, Virginia, United States, pp. 162–171. External Links: ISBN 9780996643115, Link Cited by: §1, §3.1, item 2.
 Fast estimation of causal interactions using Wold processes. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 2975–2986. External Links: Link Cited by: §5.1.
 Heterogeneities in the case fatality ratio in the West African ebola outbreak 20132016. Philosophical Transactions of the Royal Society B: Biological Sciences 372 (1721), pp. 20160308. External Links: Document, Link, https://royalsocietypublishing.org/doi/pdf/10.1098/rstb.2016.0308 Cited by: §1, item 1.
 Lasso and probabilistic inequalities for multivariate point processes. Bernoulli 21 (1), pp. 83–143. Note: 61 pages External Links: Link, Document Cited by: §2, §3.1.

Bayesian paragraph vectors
.Symposium on Advances in Approximate Bayesian Inference
. Cited by: §2.  Autoencoding variational bayes. In International Conference on Learning Representations (ICLR), Cited by: §4.1, §4.1.
 Nonparametric markovian learning of triggering kernels for mutually exciting and mutually inhibiting multivariate Hawkes processes. In Machine Learning and Knowledge Discovery in Databases, T. Calders, F. Esposito, E. Hüllermeier, and R. Meo (Eds.), Berlin, Heidelberg, pp. 161–176. Cited by: §2, §3.1.
 Discovering latent network structure in point process data. In International Conference on Machine Learning, pp. 1413–1421. Cited by: §2, footnote 3.
 Scalable Bayesian inference for excitatory point process networks. arXiv preprint arXiv:1507.03228. Cited by: §2, footnote 3.
 Bayesian inference for latent Hawkes processes. NeurIPS Symposium on Advances in Approximate Bayesian Inference Probabilistic. Cited by: §2.
 Directed information graphs. IEEE Transactions on Information Theory 61 (12), pp. 6887–6909. External Links: Document, ISSN 00189448 Cited by: §1.
 USING directed information to build biologically relevant influence networks. Journal of Bioinformatics and Computational Biology 06 (03), pp. 493–519. External Links: Document Cited by: §1.

Stochastic backpropagation and approximate inference in deep generative models
. In Proceedings of the 31st International Conference on Machine Learning, Cited by: §4.1.  Measuring information transfer. Phys. Rev. Lett. 85, pp. 461–464. External Links: Document, Link Cited by: §1.
 Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §3.2.
 Learning granger causality for Hawkes processes. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1717–1726. External Links: Link Cited by: 4th item, §2, §3.1, §3.2, §5.1, §5.
 Learning Hawkes processes from short doublycensored event sequences. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, ICML’17, pp. 3831–3840. External Links: Link Cited by: §1.
 Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §4.1.
 Learning social infectivity in sparse lowrank networks using multidimensional Hawkes processes.. In AISTATS, JMLR Workshop and Conference Proceedings, Vol. 31, pp. 641–649. Cited by: 3rd item, §2, §3.2, §5.
 Learning triggering kernels for multidimensional Hawkes processes. In International Conference on Machine Learning, Vol. 28, pp. 1301–1309. Cited by: §2, §3.2, §5.1.
Appendix A Different Priors
In this section, we provide the probabilistic interpretation as a prior of several regularizers commonly used in the literature.

[leftmargin=*]

regularizer: Perhaps the most commonly used regularizer in MHPs is the regularizer .
regularizer can be interpreted as a zeromean Gaussian distribution over the weights,
i.e.,where is the variance.

regularizer: This regularizer (also known as lasso regularizer) is considered as a convex surrogate for the (pseudo) norm. Hence, it promotes the sparsity of the parameters. It can be interpreted as a Laplace distribution over the weights, i.e.,

Lowrank regularizer: To achieve a lowrank excitation matrix , a nuclear norm penalty on is often used as a regularizer Zhou et al. [2013a], thus enabling clustering structures in . For an excitation matrix , let , then the different are independent for different and the prior over is
where is a constant.

Grouplasso regularizer: This regularizer is used in Xu et al. [2016] in the nonparametric setting defined in Section 3 where the excitation function is approximated by a linear combination of basis functions, parameterized by . In this case, the norm of is assumed to have a Laplace distribution, i.e.,
where is a constant.
Appendix B Hyperparameter Update
Below, we give a closedform solution to update in (13) for two priors used in the experiments of Section 5
. The update rules for the other priors are similar. We start by rewriting the joint distribution
as(15) 
where the first term (the likelihood) is not a function of and only the second term (the prior) is a function of . Hence minimizing over amounts to minimizing .
regularizer:
For the regularizer we have
and the closed form update hence becomes
(16) 
regularizer:
For the regularizer we have
and the closed form update hence becomes
(17) 
Appendix C Simple Optimization of
In this section, we show that we cannot simply find by optimizing the negative loglikelihood in (4) or the MAP objective in (8) over .
Fist note that, minimizing regularized negative loglikelihood in (4) over , simply sets to infinity.
Second, we show that maximizing the MAP objective in (8) over also fails because it is unbounded from above. We show this for the case of the Gaussian prior defined by
(18) 
but the same result holds for other priors. The log of the Gaussian prior (18) is
(19) 
where is a constant independent of . In the MAP objective (8), if we set and , i.e., all processes are simple Poisson process with rate 1 and no interaction between them, then the conditional intensity for all and . The loglikelihood in (5) becomes , which is bounded from below. Set , then for , we get . Hence, the MAP estimator for is unbounded from above and maximizing the MAP objective simultaneously over both the hyperparameters and the model parameters and would fail.
Appendix D Additional Experimental Results
We first carry out an additional set of experiments to evaluate the effect of zeroingout small weights using a threshold . To do so, we first need to introduce the following two performance metrics:

[leftmargin=*]

The false positive rate (FPR) to be the fraction of errors in learnt edges
where denotes the cardinality of a set.

Similarly, the false negative rate (FNR) to be the fraction of errors in learnt nonedges
Figure 4 shows the effect of number of samples on F1score for several choice of threshold . We see that our proposed algorithm VIEXP (resp. VISG) outperform its MLE counterpart MLEADM4 (resp. MLESGLP) for all values of . With increasing , we see that the F1score of MLEbased approaches improve. This is due to the FPR decreasing faster than the FNR increases due to the sparsity of the graph. However note that, since we do not know the expected value of true edges beforehand, it is not clear apriori what value we should set for the threshold . Ideally, we choose the threshold to be as small as possible, which is the regime in which our variational inference algorithm outperforms MLEbased methods the most.
In Figure 5, we plot Precision@ for different values of . The number of edges in the generated synthetic graphs is 195, so in Figure 5 we vary up to 195. We see that VIEXP always has better Precision@ than its counterpart MLEADM4. VISG has the same Precision@ for , 10, and 20 as MLESGLP. For larger MLESGLP has slightly better Precision@. Note that, Precision@ focuses only on the accuracy of top
edges learnt by an algorithm and hence does not discriminate the imbalance between precision and recall for large
in sparse graphs.Finally, to evaluate the scalability of our approach, we ran additional simulations on increasingly largedimensional problems. As shown in Figure 7, the periteration running time of our approach VIEXP (implemented in python) scales better than the one of MLEADM4 (implemented in C++). In addition, even if our gradient descent algorithm requires more iterations to converge, we show in Figure 7 that VIEXP reaches the same F1score as MLEADM4 faster.
Appendix E Reproducibility
In this section, we provide extensive details on the experimental setup used in Section 5. We first describe the implementation details of the algorithm described in Algorithm 1. We then provide the details of the experimental setup for both the synthetic and real data experiments.
e.1 Implementation details of Algorithm 1
We used sampled Gaussian noise in line 4 of Algorithm 1. We set the momentum term in (13). In our early experiments, we observed that the performance of the algorithm is not sensitive to the momentum term for . Therefore, we decided to set it to in all experiments. We used the Adam optimizer with learning rate . We also multiply the learning rate by at each iteration. Both and were initialized by sampling from the normal distribution . We initialized for all hyperparameters. We observed that the performance of the algorithm is not sensitive to the initialization. Both and were initialized by sampling from the normal distribution then clipping them to be in . This initialization ensures that the initial variance of the algorithm is neither small nor too big.
e.2 Synthetic experiments
To create the synthetic data, we generated random Erdős–Rényi graphs with nodes and with edge probability , leading to graphs with edges on average. Then, the sequences of observations were generated from an MHP with the exponential excitation kernel defined in (2). The baseline were sampled uniformly at random in , and the edge weights were sampled uniformly at random in . To generate the results of Figure 1, we varied the length of observations between and . The results were averaged over graphs with simulations each.
We used tick
^{7}^{7}7https://github.com/XDataInitiative/tick Python library to run the MLEbased baseline approaches.
To tune the hyperparameters of the MLEbased approaches, we first manually searched for an initial range of parameters where the algorithm performed well. Then, we finedtuned the hyperparameters using gridsearch to find the ones giving the best results for the Precision@20 and F1score metrics.
For MLESGLP, we used the grid range and lasso_grouplasso_ratio
. We used the default values for the optimizer, which we checked and are sure of its convergence. We finally chose and lasso_grouplasso_ratio
.
For MLEADM4, we also used the grid range and .
Making overall 21 different configurations. We finally chose and that gave the best results for Precision@20 and F1score.
e.3 Real data experiments
For our approach VIEXP and its parametric counterpart VISG, the exponential decay parameter must be tuned for each dataset. As expected, both algorithms performed best with the same decay. For our approach VISG and its MLE counterpart MLESGLP, there are two parameters to tune, and cutoff time . The center of the th Gaussian kernel, with , is defined as and its scale is defined as in (3). After manually finding an initial range of and where algorithms performed well, we then finetuned them using the gridsearch.
Epidemic dataset.
For our VISG algorithm, we did a gridsearch with and . We did not see a notable difference between the performance of different grids, as long as and are large enough. We chose and . For the baseline MLESGLP, we did a gridsearch with , and , that makes overall 352 experiments. We chose , and . For our algorithm VIEXP, we tried and we chose . For the baseline MLEADM4, we did a gridsearch with and . We chose and .
Stock market dataset.
In the stock market dataset, our algorithm VISG also performed better with a larger . As for large the experiments are slow we decided to set and did gridsearch for with . For the baseline MLESGLP, we did a gridsearch with , and . The best values found were , and . For our algorithm VIEXP, we tried and we chose . For the baseline MLEADM4, we did a grid search with and . We chose and .
Enron email dataset.
The Enron email dataset is a larger dataset and experiments are more computationally intensive, so we chose smaller ranges for hyperparameter tuning. For our algorithm VISG we did a gridsearch with and . The best value is . For the baseline MLESGLP, we did a gridsearch with , and . The best value is , and . For our algorithm VIEXP, we tried and we chose . For the baseline MLEADM4, we did a gridsearch with and . We chose and .
Comments
There are no comments yet.