In this paper we study the properties of Bayesian nonparametric procedures in the context of multivariate Hawkes processes. The aim of this paper is to give some general results on posterior concentration rates for such models and to study some families of nonparametric priors.
1.1 Hawkes processes
Hawkes processes, introduced by Hawkes (1971), are specific point processes which are extensively used to model data whose occurrences depend on previous occurrences of the same process. To describe them, we first consider a point process on . We denote by the Borel -algebra on and for any Borel set , we denote by the number of occurrences of in . For short, for any , denotes the number of occurrences in . We assume that for any , almost surely. If is the history of until , then, , the predictable intensity of at time
, which represents the probability to observe a new occurrence at timegiven previous occurrences, is defined by
where denotes an arbitrary small increment of and For the case of univariate Hawkes processes, we have
for and . We recall that the last integral means
The case of linear Hawkes processes corresponds to and for any . The parameter is the spontaneous rate and is the self-exciting function. We now assume that is a marked point process, meaning that each occurrence of is associated to a mark , see MR1950431. In this case, we can identify with a multivariate point process and for any denotes the number of occurrences of in with mark . In the sequel, we only consider linear multivariate Hawkes processes, so we assume that , the intensity of , is
where and , which is assumed to be non-negative and supported by , is the interaction function of on . Theorem 7 of BM1 shows that if the matrix , with
has a spectral radius strictly smaller than 1, then there exists a unique stationary distribution for the multivariate process with the previous dynamics and finite average intensity.
Hawkes processes have been extensively used in a wide range of applications. They are used to model earthquakes VJ-O; ogata88; MR1941459, interactions in social networks arXiv:1203.3516; icml2013_zhou13; Li:2014:LPM:2893873.2893891; arXiv:1501.00725; crane2008robust; mitchell2009hawkes; yang2013mixture, financial data embrechts2011multivariate; bacry2015hawkes; bacry2016estimation; bacry2013modelling; ait2015modeling, violence rates mohler2011self; porter2012self, genomes gusto2005fado; CSWH; MR2722456 or neuronal activities Brillinger1988; chornoboy1988maximum; okatan2005analyzing; paninski2007statistical; pillow2008spatio; HRR; reynaud2014goodness; reynaud2013inference, to name but a few.
Parametric inference for Hawkes models based on the likelihood is the most common in the literature and we refer the reader to ogata88; CSWH for instance. Non-parametric estimation has first been considered by Reynaud-Bouret and Schbath MR2722456 who proposed a procedure based on minimization of an -criterion penalized by an -penalty for univariate Hawkes processes. Their results have been extended to the multivariate setting by Hansen, Reynaud-Bouret and Rivoirard HRR where the -penalty is replaced with an -penalty. The resulting Lasso-type estimate leads to an easily implementable procedure providing sparse estimation of the structure of the underlying connectivity graph. To generalize this procedure to the high-dimensional setting, Chen, Witten and Shojaie MR3634334 proposed a simple and computationally inexpensive edge screening approach, whereas Bacry, Gaïffas and Muzy arXiv:1501.00725 combine and trace norm penalizations to take into account the low rank property of their self-excitement matrix. Very recently, to deal with non-positive interaction functions, Chen, Shojaie, Shea-Brown and Witten chen:Shoj:shea:witten combine the thinning process representation and a coupling construction to bound the dependence coefficient of the Hawkes process. Other alternatives based on spectral methods Bacry2012 or estimation through the resolution of a Wiener-Hopf system MR3480107
can also been found in the literature. These are all frequentist methods; Bayesian approaches for Hawkes models have received much less attention. To the best of our knowledge, the only contributions for the Bayesian inference are due to RasmussenMR3085883 and Blundell, Beck and Heller NIPS2012_4834 who explored parametric approaches and used MCMC to approximate the posterior distribution of the parameters.
1.2 Our contribution
In this paper, we study nonparametric posterior concentration rates when , for estimating the parameter by using realizations of the multivariate process for . Analyzing asymptotic properties in the setting where means that the observation time becomes very large hence providing a large number of observations. Note that along the paper, , the number of observed processes, is assumed to be fixed and can be viewed as a constant. Considering is a very challenging problem beyond the scope of this paper. Using the general theory of ghosal:vdv:07, we express the posterior concentration rates in terms of simple and usual quantities associated to the prior on and under mild conditions on the true parameter. Two types of posterior concentration rates are provided: the first one is in terms of the -distance on the stochastic intensity functions and the second one is in terms of the -distance on the parameter (see precise notations below). To the best of our knowledge, these are the first theoretical results on Bayesian nonparametric inference in Hawkes models. Moreover, these are the first results on -convergence rates for the interaction functions . In the frequentist literature, theoretical results are given in terms of either the -error of the stochastic intensity, as in arXiv:1501.00725 and MR3480107, or in terms of the -error on the interaction functions themselves, the latter being much more involved, as in MR2722456 and HRR. In MR2722456, the estimator is constructed using a frequentist model selection procedure with a specific family of models based on piecewise constant functions. In the multivariate setting of HRR, more generic families of approximation models are considered (wavelets of Fourier dictionaries) and then combined with a Lasso procedure, but under a somewhat restrictive assumption on the size of models that can be used to construct the estimators (see Section 5.2 of HRR). Our general results do not involve such strong conditions and therefore allow us to work with approximating families of models that are quite general. In particular, we can apply them to two families of prior models on the interaction functions : priors based on piecewise constant functions, with regular or random partitions and priors based on mixtures of Betas distributions. From the posterior concentration rates, we also deduce a frequentist convergence rate for the posterior mean, seen as a point estimator. We finally propose an MCMC algorithm to simulate from the posterior distribution for the priors constructed from piecewise constant functions and a simulation study is conducted to illustrate our results.
1.3 Overview of the paper
In Section 2, Theorem 1 first states the posterior convergence rates obtained for stochastic intensities. Theorem 2 constitues a variation of this first result. From these results, we derive -rates for the parameter (see Theorem 3) and for the posterior mean (see Corollary 1). Examples of prior models satisfying conditions of these theorems are given in Section 2.3. In Section 3, numerical results are provided.
1.4 Notations and assumptions
We denote by the true parameter and assume that the interaction functions are supported by a compact interval , with assumed to be known. Given a parameter , we denote by the spectral norm of the matrix associated with and defined in (1.2). We recall that provides an upper bound of the spectral radius of and we set
We assume that and denote by the matrix such that
For any function , we denote by the -norm of . With a slight abuse of notations, we also use for and belonging to
Finally, we consider , the following stochastic distance on :
where and denote the stochastic intensity (introduced in (1.1)) associated with and respectively. We denote by the covering number of a set by balls with respect to the metric with radius . We set for any , the mean of under
where denotes the stationary distribution associated with and is the expectation associated with . We also write if is bounded when and similarly if is bounded.
2 Main results
This section contains main results of the paper. We first provide an expression for the posterior distribution.
2.1 Posterior distribution
Using Proposition 7.3.III of MR1950431, and identifying a multivariate Hawkes process as a specific marked Hawkes process, we can write the log-likelihood function of the process observed on the interval , conditional on , as
With a slight abuse of notation, we shall also denote instead of .
Recall that we restrict ourselves to the setup where for all , has support included in for some fixed . This hypothesis is very common in the context of Hawkes processes, see HRR. Note that, in this case, the conditional distribution of observed on the interval given is equal to its conditional distribution given .
Hence, in the following, we assume that we observe the process on , but we base our inference on the log-likelihood (2.1), which is associated to the observation of on . We consider a Bayesian nonparametric approach and denote by the prior distribution on the parameter . The posterior distribution is then formally equal to
We approximate it by the following pseudo-posterior distribution, which we write
which thus corresponds to choosing .
2.2 Posterior convergence rates for and -metrics
In this section we give two results of posterior concentration rates, one in terms of the stochastic distance and another one in terms of the -distance, which constitutes the main result of this paper. We define
when is large enough. In the sequel, we take and accordingly. Note in particular that, on ,
with , when is large enough. We then have the following theorem.
Consider the multivariate Hawkes process observed on , with likelihood given by (2.1). Let be a prior distribution on Let be a positive sequence such that and
For , we consider
and assume following conditions are satisfied for large enough.
There exists and such that
There exist and such that
Then, there exist and such that
Assumptions (i), (ii) and (iii) are very common in the literature about posterior convergence rates. As expressed by Assumption (ii), some conditions are required on the prior on but not on . Except the usual concentration property of around expressed in the definition of , which is in particular satisfied if has a positive continuous density with respect to Lebesgue measure, we have no further condition on the tails of the distribution of .
As appears in the proof of Theorem 1, the term appearing in the posterior concentration rate can be dropped if is replaced by
in Assumption (i). In this case, in Assumption (ii) and does not depend on . This is used for instance in Section 2.3.1 to study random histograms priors whereas mixtures of Beta priors are controlled using the -norm.
Similarly to other general theorems on posterior concentration rates, we can consider some variants. Since the metric is stochastic, we cannot use slices in the form as in Theorem 1 of ghosal:vdv:07, however we can consider other forms of slices, using a similar idea as in Theorem 5 of ghosal:vdv:mixture:07. This is presented in the following theorem.
Consider the setting and assumptions of Theorem 1 except that assumption (iii) is replaced by the following one: There exists a sequence of sets with and such that
for some positive constant . Then, there exists such that
The posterior concentration rates of Theorems 1 and 2 are in terms of the metric on the intensity functions, which are data dependent and therefore not completely satisfying to understand concentration around the objects of interest namely . We now use Theorem 1 to provide a general result to derive a posterior concentration rate in terms of the -norm.
Assume that the prior satisfies following assumptions.
There exists such that (see the definition of ) and such that
where and .
The prior on satisfies : for all , when is large enough,
Then, for any ,
A consequence of previous theorems is that the posterior mean is converging to at the rate , which is described by the following corollary.
2.3 Examples of prior models
We thus consider priors on following the scheme
We consider absolutely continuous with respect to the Lebesgue measure on with positive and continuous density ,
a probability distribution on the set of matrices with positive entries and spectral norm, with positive density with respect to Lebesgue measures and satisfying (2.5). We now concentrate on the nonparametric part, namely the prior distribution . Then, from Theorems 1 and 3 it is enough that satisfies for each ,
for some and such that there exists with
for , and large enough. Note that from remark 1, if we have that for all
then it is enough to verify
in place of (2.8).
These conditions have been checked for a large selection of types of priors on the set of densities. We discuss here two cases: one based on random histograms, these priors make sense in particular in the context of modeling neuronal interactions and the second based on mixtures of Betas, because it leads to adaptive posterior concentration rates over a large collection of functional classes. To simplify the presentation we assume that but generalization to any is straightforward.
2.3.1 Random histogram prior
These priors are motivated by the neuronal application, where one is interested in characterizing time zones when neurons are or are not interacting (see Section 3). Random histograms have been studied quite a lot recently for density estimation, both in semi and non parametric problems. We consider two types of random histograms: regular partitions and random partitions histograms. Random histogram priors are defined by: for ,
In both cases, the prior is constructed in the following hierarchical manner:
where and are two positive constants. Denoting the -dimensional simplex, we assume that the prior on satisfies : for all , for all with for any , and all small enough, there exists such that
Many probability distributions on satisfy (2.12). For instance, if is the Dirichlet distribution with , for , and three positive constants, then (2.12) holds, see for instance castillo:rousseau:2015. Also, consider the following hierarchical prior allowing some the of ’s to be equal to 0. Set
and the indices corresponding to . Then,
Regular partition histograms correspond to for , in which case we write instead of ; while in random partition histograms we put a prior on . We now consider Hölder balls of smoothness and radius , denoted , and prove that the posterior concentration rate associated with both types of histogram priors is bounded by for , where is a constant large enough. From Remark 1, we use the version of assumption (i) based on
and need to verify (2.9). Then applying Lemma 4 of the supplementary material of castillo:rousseau:2015, we obtain for all and if is not the null function
for some and if is a constant. If then
This thus implies that for some . This result holds both for the regular grid and random grid histograms with a prior on the grid points given by with . Then condition (2.5) is verified if with and , for small enough. This condition holds for any if there exist such that when is small enough
Moreover, set for a constant, then for all ,
Therefore, (2.9) is checked. We finally obtain the following corollary.
Corollary 2 (regular partition).
To extend this result to the case of random partition histogram priors we consider the same prior on as in (2.11) and the following condition on the prior on . Writing , , we have that belongs to the -dimensional simplex and we consider a Dirichlet distribution on , with .
The proof of this corollary is given in Section 4.8. In the following section, we consider another family of priors suited for smooth functions and based on mixtures of Beta distributions.
2.3.2 Mixtures of Betas
The following family of prior distributions is inspired by rousseau:09. Consider functions
where are bounded signed measures on such that . In other words the above functions are the positive parts of mixtures of Betas distributions with parameterization so that is the mean parameter. The mixing random measures are allowed to be negative. The reason for allowing to be negative is that is then allowed to be null on sets with positive Lebesgue measure. The prior is then constructed in the following way. Writing we define a prior on via a prior on and on . In particular we assume that and . As in rousseau:09 we consider a prior on absolutely continuous with respect to Lebesgue measure and with density satisfying: there exists such that for all large enough,
There are many ways to construct discrete signed measures on , for instance, writing
the prior on is then defined by and conditionally on ,
where Ra denotes the Rademacher distribution taking values each with probability . Assume that has positive continuous density on and that there exists such that . We have the following corollary.
Consider a prior as described above. Assume that for all for some functions with . If condition (2.13) holds and if has density with respect to Lebesgue measure verifying
then, for any ,