 # Nonparametric Bayesian estimation of multivariate Hawkes processes

This paper studies nonparametric estimation of parameters of multivariate Hawkes processes. We consider the Bayesian setting and derive posterior concentration rates. First rates are derived for L1-metrics for stochastic intensities of the Hawkes process. We then deduce rates for the L1-norm of interactions functions of the process. Our results are exemplified by using priors based on piecewise constant functions, with regular or random partitions and priors based on mixtures of Betas distributions. Numerical illustrations are then proposed with in mind applications for inferring functional connec-tivity graphs of neurons.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper we study the properties of Bayesian nonparametric procedures in the context of multivariate Hawkes processes. The aim of this paper is to give some general results on posterior concentration rates for such models and to study some families of nonparametric priors.

### 1.1 Hawkes processes

Hawkes processes, introduced by Hawkes (1971), are specific point processes which are extensively used to model data whose occurrences depend on previous occurrences of the same process. To describe them, we first consider a point process on . We denote by the Borel -algebra on and for any Borel set , we denote by the number of occurrences of in . For short, for any , denotes the number of occurrences in . We assume that for any , almost surely. If is the history of until , then, , the predictable intensity of at time

, which represents the probability to observe a new occurrence at time

given previous occurrences, is defined by

 λtdt=P(dNt=1|Gt−),

where denotes an arbitrary small increment of and For the case of univariate Hawkes processes, we have

 λt=ϕ(∫t−−∞h(t−s)dNs),

for and . We recall that the last integral means

 ∫t−−∞h(t−s)dNs=∑Ti∈N:Ti

The case of linear Hawkes processes corresponds to and for any . The parameter is the spontaneous rate and is the self-exciting function. We now assume that is a marked point process, meaning that each occurrence of is associated to a mark , see MR1950431. In this case, we can identify with a multivariate point process and for any denotes the number of occurrences of in with mark . In the sequel, we only consider linear multivariate Hawkes processes, so we assume that , the intensity of , is

 λkt=νk+K∑ℓ=1∫t−−∞hℓ,k(t−u)dNℓu, (1.1)

where and , which is assumed to be non-negative and supported by , is the interaction function of on . Theorem 7 of BM1 shows that if the matrix , with

 ρℓ,k=∫+∞0hℓ,k(t)dt,ℓ,k=1,…,K, (1.2)

has a spectral radius strictly smaller than 1, then there exists a unique stationary distribution for the multivariate process with the previous dynamics and finite average intensity.

Hawkes processes have been extensively used in a wide range of applications. They are used to model earthquakes VJ-O; ogata88; MR1941459, interactions in social networks arXiv:1203.3516; icml2013_zhou13; Li:2014:LPM:2893873.2893891; arXiv:1501.00725; crane2008robust; mitchell2009hawkes; yang2013mixture, financial data embrechts2011multivariate; bacry2015hawkes; bacry2016estimation; bacry2013modelling; ait2015modeling, violence rates mohler2011self; porter2012self, genomes gusto2005fado; CSWH; MR2722456 or neuronal activities Brillinger1988; chornoboy1988maximum; okatan2005analyzing; paninski2007statistical; pillow2008spatio; HRR; reynaud2014goodness; reynaud2013inference, to name but a few.

Parametric inference for Hawkes models based on the likelihood is the most common in the literature and we refer the reader to ogata88; CSWH for instance. Non-parametric estimation has first been considered by Reynaud-Bouret and Schbath MR2722456 who proposed a procedure based on minimization of an -criterion penalized by an -penalty for univariate Hawkes processes. Their results have been extended to the multivariate setting by Hansen, Reynaud-Bouret and Rivoirard HRR where the -penalty is replaced with an -penalty. The resulting Lasso-type estimate leads to an easily implementable procedure providing sparse estimation of the structure of the underlying connectivity graph. To generalize this procedure to the high-dimensional setting, Chen, Witten and Shojaie MR3634334 proposed a simple and computationally inexpensive edge screening approach, whereas Bacry, Gaïffas and Muzy arXiv:1501.00725 combine and trace norm penalizations to take into account the low rank property of their self-excitement matrix. Very recently, to deal with non-positive interaction functions, Chen, Shojaie, Shea-Brown and Witten chen:Shoj:shea:witten combine the thinning process representation and a coupling construction to bound the dependence coefficient of the Hawkes process. Other alternatives based on spectral methods Bacry2012 or estimation through the resolution of a Wiener-Hopf system MR3480107

can also been found in the literature. These are all frequentist methods; Bayesian approaches for Hawkes models have received much less attention. To the best of our knowledge, the only contributions for the Bayesian inference are due to Rasmussen

MR3085883 and Blundell, Beck and Heller NIPS2012_4834 who explored parametric approaches and used MCMC to approximate the posterior distribution of the parameters.

### 1.2 Our contribution

In this paper, we study nonparametric posterior concentration rates when , for estimating the parameter by using realizations of the multivariate process for . Analyzing asymptotic properties in the setting where means that the observation time becomes very large hence providing a large number of observations. Note that along the paper, , the number of observed processes, is assumed to be fixed and can be viewed as a constant. Considering is a very challenging problem beyond the scope of this paper. Using the general theory of ghosal:vdv:07, we express the posterior concentration rates in terms of simple and usual quantities associated to the prior on and under mild conditions on the true parameter. Two types of posterior concentration rates are provided: the first one is in terms of the -distance on the stochastic intensity functions and the second one is in terms of the -distance on the parameter (see precise notations below). To the best of our knowledge, these are the first theoretical results on Bayesian nonparametric inference in Hawkes models. Moreover, these are the first results on -convergence rates for the interaction functions . In the frequentist literature, theoretical results are given in terms of either the -error of the stochastic intensity, as in arXiv:1501.00725 and MR3480107, or in terms of the -error on the interaction functions themselves, the latter being much more involved, as in MR2722456 and HRR. In MR2722456, the estimator is constructed using a frequentist model selection procedure with a specific family of models based on piecewise constant functions. In the multivariate setting of HRR, more generic families of approximation models are considered (wavelets of Fourier dictionaries) and then combined with a Lasso procedure, but under a somewhat restrictive assumption on the size of models that can be used to construct the estimators (see Section 5.2 of HRR). Our general results do not involve such strong conditions and therefore allow us to work with approximating families of models that are quite general. In particular, we can apply them to two families of prior models on the interaction functions : priors based on piecewise constant functions, with regular or random partitions and priors based on mixtures of Betas distributions. From the posterior concentration rates, we also deduce a frequentist convergence rate for the posterior mean, seen as a point estimator. We finally propose an MCMC algorithm to simulate from the posterior distribution for the priors constructed from piecewise constant functions and a simulation study is conducted to illustrate our results.

### 1.3 Overview of the paper

In Section 2, Theorem 1 first states the posterior convergence rates obtained for stochastic intensities. Theorem 2 constitues a variation of this first result. From these results, we derive -rates for the parameter (see Theorem 3) and for the posterior mean (see Corollary 1). Examples of prior models satisfying conditions of these theorems are given in Section 2.3. In Section 3, numerical results are provided.

### 1.4 Notations and assumptions

We denote by the true parameter and assume that the interaction functions are supported by a compact interval , with assumed to be known. Given a parameter , we denote by the spectral norm of the matrix associated with and defined in (1.2). We recall that provides an upper bound of the spectral radius of and we set

 H={(hℓ,k)k,ℓ=1,…,K;hℓ,k≥0,%support(hℓ,k)⊂[0,A],ρℓ,k<∞,∀k,ℓ=1,…,K,∥ρ∥<1}

and

 F={f=((νk)k=1,…,K,(hℓ,k)k,ℓ=1,…,K); 0<νk<∞, ∀k=1,…,K, (hℓ,k)k,ℓ=1,…,K∈H}.

We assume that and denote by the matrix such that

For any function , we denote by the -norm of . With a slight abuse of notations, we also use for and belonging to

 ∥f−f′∥1=K∑k=1|νk−ν′k|+K∑k=1K∑ℓ=1∥hℓ,k−h′ℓ,k∥1. (1.3)

Finally, we consider , the following stochastic distance on :

 d1,T(f,f′)=1TK∑k=1∫T0|λkt(f)−λkt(f′)|dt,

where and denote the stochastic intensity (introduced in (1.1)) associated with and respectively. We denote by the covering number of a set by balls with respect to the metric with radius . We set for any , the mean of under

 μ0ℓ=E0[λℓt(f0)],

where denotes the stationary distribution associated with and is the expectation associated with . We also write if is bounded when and similarly if is bounded.

## 2 Main results

This section contains main results of the paper. We first provide an expression for the posterior distribution.

### 2.1 Posterior distribution

Using Proposition 7.3.III of MR1950431, and identifying a multivariate Hawkes process as a specific marked Hawkes process, we can write the log-likelihood function of the process observed on the interval , conditional on , as

 LT(f) := K∑k=1[∫T0log(λkt(f))dNkt−∫T0λkt(f)dt]. (2.1)

With a slight abuse of notation, we shall also denote instead of .

Recall that we restrict ourselves to the setup where for all , has support included in for some fixed . This hypothesis is very common in the context of Hawkes processes, see HRR. Note that, in this case, the conditional distribution of observed on the interval given is equal to its conditional distribution given .

Hence, in the following, we assume that we observe the process on , but we base our inference on the log-likelihood (2.1), which is associated to the observation of on . We consider a Bayesian nonparametric approach and denote by the prior distribution on the parameter . The posterior distribution is then formally equal to

 Π(B|N,G0−)=∫Bexp(LT(f))dΠ(f|G0−)∫Fexp(LT(f))dΠ(f|G0−).

We approximate it by the following pseudo-posterior distribution, which we write

 Π(B|N)=∫Bexp(LT(f))dΠ(f)∫Fexp(LT(f))dΠ(f), (2.2)

which thus corresponds to choosing .

### 2.2 Posterior convergence rates for d1,T and L1-metrics

In this section we give two results of posterior concentration rates, one in terms of the stochastic distance and another one in terms of the -distance, which constitutes the main result of this paper. We define

 ΩT={maxℓ∈{1,…,K}supt∈[0,T]Nℓ[t−A,t)≤CαlogT}∩{K∑ℓ=1∣∣∣Nℓ[−A,T]T−μ0ℓ∣∣∣≤δT}

with and and two positive constants not depending on . From Lemmas 3 and 4 in Section 4.7, we have that for all there exist and only depending on and such that

 P0(ΩcT)≤T−α, (2.3)

when is large enough. In the sequel, we take and accordingly. Note in particular that, on ,

 K∑ℓ=1Nℓ[−A,T]≤N0T,

with , when is large enough. We then have the following theorem.

###### Theorem 1.

Consider the multivariate Hawkes process observed on , with likelihood given by (2.1). Let be a prior distribution on Let be a positive sequence such that and

 loglog(T)log3(T)=o(Tϵ2T).

For , we consider

 B(ϵT,B):={(νk,(hℓ,k)ℓ)k:maxk|νk−ν0k|≤ϵT,maxℓ,k∥hℓ,k−h0ℓ,k∥2≤ϵT,maxℓ,k∥hℓ,k∥∞≤B}

and assume following conditions are satisfied for large enough.

• There exists and such that

 Π(B(ϵT,B))≥e−c1Tϵ2T.
• There exists a subset , such that

 Π(HcT)Π(B(ϵT,B))≤e−(2κT+3)Tϵ2T,

where , with defined in (4.11) and defined in (4.9).

• There exist and such that

 logN(ζ0ϵT,HT,∥.∥1)≤x0Tϵ2T.

Then, there exist and such that

 E0[Π(d1,T(f0,f)>M√loglogTϵT|N)]≤Cloglog(T)log3(T)Tϵ2T+P0(ΩcT)+o(1)=o(1).

Assumptions (i), (ii) and (iii) are very common in the literature about posterior convergence rates. As expressed by Assumption (ii), some conditions are required on the prior on but not on . Except the usual concentration property of around expressed in the definition of , which is in particular satisfied if has a positive continuous density with respect to Lebesgue measure, we have no further condition on the tails of the distribution of .

###### Remark 1.

As appears in the proof of Theorem 1, the term appearing in the posterior concentration rate can be dropped if is replaced by

 B∞(ϵT,B)={(νk,(hℓ,k)ℓ)k:maxk|νk−ν0k|≤ϵT,maxℓ,k∥hℓ,k−h0ℓ,k∥∞≤ϵT},

in Assumption (i). In this case, in Assumption (ii) and does not depend on . This is used for instance in Section 2.3.1 to study random histograms priors whereas mixtures of Beta priors are controlled using the -norm.

Similarly to other general theorems on posterior concentration rates, we can consider some variants. Since the metric is stochastic, we cannot use slices in the form as in Theorem 1 of ghosal:vdv:07, however we can consider other forms of slices, using a similar idea as in Theorem 5 of ghosal:vdv:mixture:07. This is presented in the following theorem.

###### Theorem 2.

Consider the setting and assumptions of Theorem 1 except that assumption (iii) is replaced by the following one: There exists a sequence of sets with and such that

 ∞∑i=1N(ζ0ϵT,HT,i,∥.∥1)√Π(HT,i)e−x0Tϵ2T=o(1), (2.4)

for some positive constant . Then, there exists such that

 E0[Π(d1,T(f0,f)>M√loglogTϵT|N)]=o(1).

The posterior concentration rates of Theorems 1 and 2 are in terms of the metric on the intensity functions, which are data dependent and therefore not completely satisfying to understand concentration around the objects of interest namely . We now use Theorem 1 to provide a general result to derive a posterior concentration rate in terms of the -norm.

###### Theorem 3.

Assume that the prior satisfies following assumptions.

• There exists such that (see the definition of ) and such that

 E0[Π(AcεT|N)]=o(1)&P0(DT

where and .

• The prior on satisfies : for all , when is large enough,

 Π(∥ρ∥>1−u0(logT)1/6ε1/3T)≤e−2c1Tε2T. (2.5)

Then, for any ,

 E0[Π(∥f−f0∥1>wTεT|N)]=o(1). (2.6)
###### Remark 2.

Condition (i) of Theorem 3 is in particular verified under the assumptions of Theorem 1, with for a constant.

###### Remark 3.

Compared to Theorem 1, we also assume (ii), i.e. that the prior distribution puts very little mass near the boundary of space . In particular, if under , has its support included in for a fixed small then (2.5) is verified.

A consequence of previous theorems is that the posterior mean is converging to at the rate , which is described by the following corollary.

###### Corollary 1.

Under the assumptions of Theorem 1 or Theorem 2, together with (2.5) with and if , then for any

 P0(∥^f−f0∥1>wTεT)=o(1).

The proof of Corollary 1 is given in Section 4.6. We now illustrate these general results on specific prior models.

### 2.3 Examples of prior models

The advantage of Theorems 1 and 3 is that the conditions required on the priors on the functions are quite standard, in particular if the functions are parameterized in the following way

 hk,ℓ=ρk,ℓ¯hk,ℓ,∫A0¯hk,ℓ(u)du=1.

We thus consider priors on following the scheme

 νℓiid∼Πν,ρ=(ρk,ℓ)k,ℓ≤K∼Πρ,¯hk,ℓiid∼Πh. (2.7)

We consider absolutely continuous with respect to the Lebesgue measure on with positive and continuous density ,

a probability distribution on the set of matrices with positive entries and spectral norm

, with positive density with respect to Lebesgue measures and satisfying (2.5). We now concentrate on the nonparametric part, namely the prior distribution . Then, from Theorems 1 and 3 it is enough that satisfies for each ,

 Πh(∥¯h−¯h0k,ℓ∥2≤ϵT,∥¯h∥∞≤B)≥e−cTϵ2T,

for some and such that there exists with

 F1,T⊂{h:[0,A]→R+,∫A0h(x)dx=1}

satisfying

 Πh(Fc1,T)≤e−CTϵ2TloglogT,N(ζϵT;F1,T;∥.∥1)≤x0Tϵ2T, (2.8)

for , and large enough. Note that from remark 1, if we have that for all

 Πh(∥¯h−¯h0k,ℓ∥∞≤ϵT,∥¯h∥∞≤B)≥e−cTϵ2T

then it is enough to verify

 Πh(Fc1,T)≤e−CTϵ2T,N(ζϵT;F1,T;∥.∥1)≤x0Tϵ2T, (2.9)

in place of (2.8).

These conditions have been checked for a large selection of types of priors on the set of densities. We discuss here two cases: one based on random histograms, these priors make sense in particular in the context of modeling neuronal interactions and the second based on mixtures of Betas, because it leads to adaptive posterior concentration rates over a large collection of functional classes. To simplify the presentation we assume that but generalization to any is straightforward.

#### 2.3.1 Random histogram prior

These priors are motivated by the neuronal application, where one is interested in characterizing time zones when neurons are or are not interacting (see Section 3). Random histograms have been studied quite a lot recently for density estimation, both in semi and non parametric problems. We consider two types of random histograms: regular partitions and random partitions histograms. Random histogram priors are defined by: for ,

 ¯hw,t,J=δJ∑j=1wjtj−tj−11Ij,Ij=(tj−1,tj),J∑j=1wj=1,δ∼Bern(p) (2.10)

and

 T0=0

In both cases, the prior is constructed in the following hierarchical manner:

 J∼ΠJ,e−c1xL1(x)≲ΠJ(J=x),ΠJ(J>x)≲e−c2xL1(x),L1(x)=1 or L1(x)=logx(w1,…,wJ)|J∼Πw, (2.11)

where and are two positive constants. Denoting the -dimensional simplex, we assume that the prior on satisfies : for all , for all with for any , and all small enough, there exists such that

 Πw((w01−u/J2,w01+u/J2)×⋯×(w0J−u/J2,w0J+u/J2))>e−cJlogJ. (2.12)

Many probability distributions on satisfy (2.12). For instance, if is the Dirichlet distribution with , for , and three positive constants, then (2.12) holds, see for instance castillo:rousseau:2015. Also, consider the following hierarchical prior allowing some the of ’s to be equal to 0. Set

 Zjiid∼Be(p),j≤J,sz=J∑j=1Zj

and the indices corresponding to . Then,

 (wj1,⋯wjsz)∼D(α1,J,⋯,αsz,J),c3J−a≤αi,J≤c4wj=0if Zj=0.

Regular partition histograms correspond to for , in which case we write instead of ; while in random partition histograms we put a prior on . We now consider Hölder balls of smoothness and radius , denoted , and prove that the posterior concentration rate associated with both types of histogram priors is bounded by for , where is a constant large enough. From Remark 1, we use the version of assumption (i) based on

 B∞(ϵT,B)={(νk,(hℓ,k)ℓ)k:maxk|νk−ν0k|≤ϵT,maxℓ,k∥hℓ,k−h0ℓ,k∥∞≤ϵT},

and need to verify (2.9). Then applying Lemma 4 of the supplementary material of castillo:rousseau:2015, we obtain for all and if is not the null function

 Π(∥¯hw,J−¯h0∥∞≤2L0J−β|J)≳pe−cJlogT

for some and if is a constant. If then

 Π(∥¯hw,J−¯h0∥∞=0)=1−p.

This thus implies that for some . This result holds both for the regular grid and random grid histograms with a prior on the grid points given by with . Then condition (2.5) is verified if with and , for small enough. This condition holds for any if there exist such that when is small enough

 Π(∥ρ∥>1−u)≲e−a′e−1/uτ. (2.13)

Moreover, set for a constant, then for all ,

 N(ζϵT,F1,T,∥.∥1)≲J1(T/logT)1/(2β+1)logT.

Therefore, (2.9) is checked. We finally obtain the following corollary.

###### Corollary 2 (regular partition).

Under the random histogram prior (2.10) based on a regular partition and verifying (2.11) and (2.12) and if (2.13) is satisfied, then if for any , belongs to for , then for any ,

 E0[Π(∥f−f0∥1>wT(T/logT)−β/(2β+1)|N)]=o(1).

To extend this result to the case of random partition histogram priors we consider the same prior on as in (2.11) and the following condition on the prior on . Writing , , we have that belongs to the -dimensional simplex and we consider a Dirichlet distribution on , with .

###### Corollary 3.

Consider the random histogram prior (2.10) based on random partition with a prior on satisfying (2.11) and (2.12) and with a Dirichlet prior on , with parameter . If (2.13) is satisfied, then if for any , belongs to for , then for any ,

 E0[Π(∥f−f0∥1>wT(T/logT)−β/(2β+1)|N)]=o(1).

The proof of this corollary is given in Section 4.8. In the following section, we consider another family of priors suited for smooth functions and based on mixtures of Beta distributions.

#### 2.3.2 Mixtures of Betas

The following family of prior distributions is inspired by rousseau:09. Consider functions

 hk,ℓ=ρk,ℓ(∫10gαk,ℓ,ϵdMk,ℓ(ϵ))+,gα,ϵ(x)=Γ(α/(ϵ(1−ϵ)))Γ(α/ϵ)Γ(α/(1−ϵ))xα1−ϵ−1(1−x)αϵ−1

where are bounded signed measures on such that . In other words the above functions are the positive parts of mixtures of Betas distributions with parameterization so that is the mean parameter. The mixing random measures are allowed to be negative. The reason for allowing to be negative is that is then allowed to be null on sets with positive Lebesgue measure. The prior is then constructed in the following way. Writing we define a prior on via a prior on and on . In particular we assume that and . As in rousseau:09 we consider a prior on absolutely continuous with respect to Lebesgue measure and with density satisfying: there exists such that for all large enough,

 πα(c1u<αc3u)≤Ce−b1u1/2. (2.14)

There are many ways to construct discrete signed measures on , for instance, writing

 M=J∑j=1rjpjδϵj, (2.15)

the prior on is then defined by and conditionally on ,

 rjiid∼Ra(1/2),ϵjiid∼Gϵ,(p1,⋯,pJ)∼D(a1,⋯,aJ),

where Ra denotes the Rademacher distribution taking values each with probability . Assume that has positive continuous density on and that there exists such that . We have the following corollary.

###### Corollary 4.

Consider a prior as described above. Assume that for all for some functions with . If condition (2.13) holds and if has density with respect to Lebesgue measure verifying

 xA1(1−x)A1≲gϵ(x)≲x3(1−x)3,for some A1≥3,

then, for any ,

 E0