# Tensor Kernel Recovery for Spatio-Temporal Hawkes Processes

We estimate the general influence functions for spatio-temporal Hawkes processes using a tensor recovery approach by formulating the location dependent influence function that captures the influence of historical events as a tensor kernel. We assume a low-rank structure for the tensor kernel and cast the estimation problem as a convex optimization problem using the Fourier transformed nuclear norm (TNN). We provide theoretical performance guarantees for our approach and present an algorithm to solve the optimization problem. We demonstrate the efficiency of our estimation with numerical simulations.

## Authors

• 2 publications
• 1 publication
• 58 publications
• ### Tensor Robust Principal Component Analysis with A New Tensor Nuclear Norm

In this paper, we consider the Tensor Robust Principal Component Analysi...
04/10/2018 ∙ by Canyi Lu, et al. ∙ 0

• ### Exact Recovery of Tensor Robust Principal Component Analysis under Linear Transforms

This work studies the Tensor Robust Principal Component Analysis (TRPCA)...
07/16/2019 ∙ by Canyi Lu, et al. ∙ 2

• ### Robust Tensor Principal Component Analysis: Exact Recovery via Deterministic Model

Tensor, also known as multi-dimensional array, arises from many applicat...
08/05/2020 ∙ by Bo Shen, et al. ∙ 11

• ### Tensor p-shrinkage nuclear norm for low-rank tensor completion

In this paper, a new definition of tensor p-shrinkage nuclear norm (p-TN...
07/09/2019 ∙ by Chunsheng Liu, et al. ∙ 0

• ### Electricity Market Forecasting via Low-Rank Multi-Kernel Learning

The smart grid vision entails advanced information technology and data a...
10/02/2013 ∙ by Vassilis Kekatos, et al. ∙ 0

• ### Convex Recovery of Marked Spatio-Temporal Point Processes

We present a multi-dimensional Bernoulli process model for spatial-tempo...
03/29/2020 ∙ by Anatoli Juditsky, et al. ∙ 10

• ### Convex Parameter Recovery for Interacting Marked Processes

We introduce a new general modeling approach for multivariate discrete e...
03/29/2020 ∙ by Anatoli Juditsky, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Hawkes processes, a type of self- (and mutual) exciting point processes, have gained substantial attention in machine learning and statistics due to its wide applicability in capturing complex interactions of discrete events over space, time, and possibly networks. Such problem arises from many applications such as seismology

[21], criminology [26], finance [14, 22], genomics [23] and social network [19, 25]. One advantage of Hawkes process modeling is that interactions between the history and a current event can be represented in the structure easily, as the Hawkes process, in general, has an intensity function consisting of two parts, a baseline intensity, and a triggering effect.

A central problem in Hawkes process modeling is to estimating the triggering effect through the so-called influence functions, which captures how different locations interact with each other. Estimating the triggering effects with Hawkes process models has been conducted in several prior works [17, 18, 13, 1, 25]. Bacry et al. in [1] and Zhou et al. in [25] proposed a convex optimization approach with sparse and low-rank assumptions on the interaction adjacency matrix or tensor to estimate an influence in a social network. In particular, they assumed the triggering function as a form of a product between the interaction coefficients and the fixed kernel functions that decay exponentially with continuous time.

Low-dimensional structures are very common in high-dimensional data, such as low-rank matrices and low-rank tensors. A recent motivation for studying of low-rank matrices is due to the matrix completion problem

[7, 6, 8, 20, 10]. One of the popular approaches is convex relaxations with a matrix nuclear norm to estimate a low-rank matrix. There has been much effort in modeling with low-rank tensors by extending the results on low-rank matrices, including [11, 12, 2, 3, 5]. However, unlike matrices, the rank of a tensor is not uniquely defined, and it can have multiple ranks such as the CP rank, Tucker rank, tubal rank, and multi-rank. The tubal and multi-rank for the low-rank kernel tensor were proposed by Kilmer et al. [16] with the algebra for a tensor and the corresponding Fourier-transformed tensor nuclear norm (TNN). In this work, we will consider a low-rank structure for the tensor kernel capturing the interaction, which can be viewed as a low-rank approximation to capture the dominant mode of the influence functions. We will also consider the tubal and multi-rank for the low-rank kernel tensor.

In this paper, we present a new approach to estimate the general influence functions for spatio-temporal Hawkes processes by formulating it as a tensor recovery problem. The location dependent influence function is parameterized as a tensor kernel. We further assume a low-rank structure for the tensor kernel as a low-rank approximation to the true influence functions. We estimate the tensor kernel using maximum likelihood with constrained Fourier transformed nuclear norm (TNN) on the tensor, which leads to a convex optimization problem. Theoretical performance guarantees are presented for squared recovery error. We also present a computationally efficient algorithm based on the alternating direction method of multipliers (ADMM) to solve the optimization problem. We demonstrate the efficiency of our estimation procedure with numerical simulations. Note that our approach is different from the previous works [1, 25] since we consider the influence function as a tensor kernel in the discrete-time and discrete location set-up.

The rest of the paper is organized as follows. Section 2 presents the problem setup. Section 3 contains the main theoretical performance upper bound. Section 4 proposes an ADMM based algorithm to solve the optimization problem. Section 5 contains the numerical study and finally, Section 6 concludes the paper. The proofs are delegated to the Appendix.

## 2 Spatio-temporal interaction as low-rank tensor

We first introduce some preliminaries. Assume that we have a sequence of discrete event data in the form of , where denote the location and denotes the time of the th event. Consider a spatio-temporal point process whose events occur at time , and the location . Define a counting process , where is the set of nonnegative integers. is the number of events in the region and the time window , where and are the Borel -algebras of and . Define is the -algebras of data before time which is a filtration. In this section, we will first generate an approximation of Hawkes process in discrete time and then formulate our problem.

### 2.1 Discrete Hawkes processes

The conditional intensity function of a point process is defined as

 λ(x,y,t):=limΔx,Δy,Δt→0E(N([x,x+Δx]×[y,y+Δy]×[t,t+Δt]|Ht)ΔxΔyΔt. (1)

For Hawkes processes, we can define the conditional intensity function with the following form:

 λ(x,y,t)=μ(x,y)+∫t0∫∫Bg(x−u1,y−u2,t−u3)N(d(u1×u2)×du3), (2)

where is the base intensity of location and is the kernel function. To discretize the process in both space and time, we define the “bin counts” over a discrete time horizon :

 Zijk=N([(i−1)Δx,iΔx)×[(j−1)Δy,jΔy)×[(k−1)Δt,kΔt)). (3)

Let , and be small and . Suppose and are bounded, so there exist , such that and we assume , for some , where is the set of positive integers. Let

 Zt :={Zijk,1≤i≤n1,1≤j≤n2,−p+1≤k≤t}∈Rn1×n2×(t+p), (4) Ztq :={Zijk,1≤i≤n1,1≤j≤n2,q≤k≤t}∈Rn1×n2×(t−q+1). (5)

For discrete Hawkes process, we have

 Zijk|H(k−1)∼Poisson(Δλijk). (6)

We will derive the discrete form of the intensity function . In general, has the following form:

 g(x,y,t)=h(x,y)f(t), (7)

where is a non-negative monotone decreasing function, and for large , goes to zero. A standard choice of is a class of exponential kernels, . Thus, according to (1), we have the following approximation for the discretized intensity function with a memory depth :

 E[Zijk|H(k−1)]≈Δλ((i−1)Δx,(j−1)Δy,(k−1)Δt) (8) =Δμ((i−1)Δx,(j−1)Δy)+Δk−1∑k′=k−pn1∑i′=1n2∑j′=1g((i−i′)Δx,(j−j′)Δy,(k−k′)Δt)Zi′j′k′ :=Δ(μij+k−1∑k′=k−pn1∑i′=1n2∑j′=1Gi−i′+n1,j−j′+n2,k−k′Zi′j′k′),

which can be viewed as a vector autoregressive model with a lag

. Thus, the conditional intensity function for Hawkes processes can be defined as follows:

 λijk(μ,G) :=λ((i−1)Δx,(j−1)Δy,(k−1)Δt) (9) =μij+k−1∑k′=k−pn1∑i′=1n2∑j′=1Gi−i′+n1,j−j′+n2,k−k′Zi′j′k′ (10)

for , where and are the discretized version of base intensity and the kernel , respectively. Our gaol is to estimate the true parameters and .

Note that this approach is also considered in [17] to estimate Hawkes processes under discrete-time. However, the difference is that we are dealing with a more complex setting with higher dimensions (two dimensions in location and one in time), and we discretize the location space and time-space simultaneously, leading to the tensor . Our interpretation of the discretized version of the kernel function enables the presence of space-time interactions. Most importantly, we will impose constraints on the rank of the tensor and entrywise bounds upon the estimators. That strategy brings about a better estimation of the Hawkes process when its coefficients are sparse and low-rank, which is a common case. Thus, different from [17], which employed the projection method on the approximated time series model INAR(), we construct a convex optimization problem based on the maximum likelihood formulation and our corresponding regularization.

To estimate the base intensity matrix and the underlying tensor , we make the following assumptions on the candidate set. First, we assume that each entry of and of has an upper bound, say, and respectively. This assumption is common in the real world, and it is necessary to ensure that our problem is well-posed. Second, we assume that there exist nonnegative constants and such that , and for all . This condition ensures the analysis in the following theoretical part on bounding the estimation errors. Finally, we assume that the tensor has low transformed multi-rank , where , being the th frontal face of , and define . This assumption is based on that high correlations exist within locations and time. A similar approach has been applied to analyze multivariate spatio-temporal data in [2], where a general low-rank tensor completion model was employed with no additional matrix parameter.

### 2.2 Low-rank tensor regularization

In this section, we introduce the tensor nuclear norm (TNN), which we use to guarantee the low-rankness of . Before moving on, we first summarize the notations. For matrix , let

be the spectral norm, the largest absolute singular value,

be the nuclear norm, the sum of the singular values, be the Frobenius norm, and be the infinity norm. For a 3-way tensor , we denote the th frontal face of by . Let be the tensor obtained by applying the Fast Fourier Transform (FFT) to the mode-3 fibers of . Define the norm of tensor, , . We introduce three operators for a tensor: fold, unfold and bcirc.

 unfold(G) =⎛⎜ ⎜ ⎜ ⎜ ⎜⎝G(1)G(2)⋮G(N3)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠, (11) fold(unfold(G)) =G, (12) bcirc(G) =⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝G(1)G% (N3)G(N3−1)⋯G(2)G(2)G(1)G(N3)⋯G(3)⋮⋱⋱⋱⋮G(N3)G(N3−1)G(N3−2)⋯G(1)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠. (13)

For two tensors and , the t-product is defined as

 G1∗G2=fold(% bcirc(G1)unfold(G2)). (14)

Note that Kilmer et al. [16]

proposed a singular value decomposition (SVD) method for three way tensors. Based on this tensor SVD, TNN is proposed by Semerci et al.

[24]. We review some background material on the tensor SVD mostly referring to [16]. Consider a tensor , the following lemma describes the block diagonalization property of block circulant matrices.

###### Lemma 1.

For , . We have

 (FN3⊗IN1)bcirc(G)(F∗N3⊗IN2)=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝˜G(1)˜G(2)⋱˜G(N3)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

where is the Kronecker product, and are the identity matrices in and , respectively, is the normalized discrete Fourier Transform matrix, denotes the conjugate transpose of F, and s are the frontal faces of the tensor .

The following theorem provides the tensor-SVD for three-way tensors:

###### Theorem 2.

[16, Theorem 4.1] Any tensor can be factored as

 G =U∗S∗V⊤=N1∧N2∑i=1U(:,i,:)∗S(i,i,:)∗V(:,i,:)⊤, (15)

where and extract the corresponding tensor frontal slice.

From the construction of proof for Theorem 2 and according to Lemma 1, we have the following relationship in the Fourier domain:

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝˜G(1)˜G(2)⋱˜G(N3)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠= (16) ⋅⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝˜V(1)˜V(2)⋱˜V(N3)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠, (17)

where is the SVD. With the above properties, TNN can be defined as follows.

###### Definition 1.

[11, Section 4.2] The tensor nuclear norm for is defined as

 ∥G∥TNN=N1∧N2∑i=1N3∑j=1˜S(i,i,j)=∥bcirc(G)∥∗ (18)

Note that the dual norm of the tensor nuclear norm is the tensor spectral norm .

### 2.3 Maximum likelihood estimation

By our assumptions on the low-rank tensor , we have

 ∥G∥TNN=∥bcric(G)∥∗ =∥blockdiag(˜G)∥∗≤√γ∥blockdiag(˜G)∥F (19) =√γ∥blockdiag(G)∥F≤b2√γ(2n1+1)(2n2+1)p. (20)

Accordingly, the candidate set for the true value is defined as:

 D:={(μ,G)| a1≤μij≤b1, a2≤ Gijk≤b2, (21) ∥G∥TNN≤b2√γ(2n1+1)(2n2+1)p }. (22)

We consider a formulation by maximizing the log-likelihood function of the optimization variable and given our observations . The log-likelihood function is given by

 F(μ,G) :=K∑k=1n1∑i=1n2∑j=1(Δλijk(μ,G)−Zijklog(Δλijk(μ,G))). (23)

Based on the candidate set defined in the previous section, an estimator can be obtained by solving the following convex optimization problem.

 (ˆμ,ˆG)=argmin(μ,G)∈DF(μ,G). (24)
###### Remark 1.

Note that the convex optimization problem (24) constrained in a candidate set can also be formulated as a regularized maximum likelihood function problem. Indeed, there exists a constant such that problem (24) equals to

 (ˆμ,ˆG)=argmin{F(μ,G)+τ∥G∥TNN}

with the natural constraint upon the entries:

 a1≤μij≤b1, a2≤Gijk≤b2

for . It follows from the duality theory in optimization (see [4] for more details). In the algorithm part (Section 4), we use the regularized form.

## 3 Theoretical guarantee

We present an upper bound for the sum of squared errors of the two estimators, which is defined by

 R[(μ,G)||(ˆμ,ˆG)]:=∥μ−ˆμ∥2F+∥G−ˆG∥2F, (25)

where and are the optimal solution to (24).

Let us first define the “condition number” as in [15] to describe our theoretical guarantee.

###### Definition 2.

Given and , the condition number is defined by

 δp[A]:=max{δ≥0:g⊤Ag≥δ∥g∥2p, ∀g∈Rd},

where () denotes that is a positive semidefinite matrix (a positive definite matrix, respectively). Note that when .

Now, we present our main theorem.

###### Theorem 3.

(Estimation error driven by data) Assume that and are the optimal solution to (24). Let

 J–– =a1+a2mink{||Zk−1k−p||1}, ¯J =b1+√γ(2n1−1)(2n2−1)pmaxk{∥Zk−1k−p∥spec} T =18J––(¯J−J––)2.

is a mapping defined in the Appendix, where is the number of unknown parameters and . Then, for every , , it holds that

 R[(μ,G)||(ˆμ,ˆG)] (26) ≤8¯JTn1n2|ln(Δ¯J)|K(1−e−T)Δδ2[A[ZK]][Kln2(α12n1n2K)ln2n1n2α2+Δ¯JKln2n1n2Kα1ln2n1n2α2 (27) +Kln2n1n2Kα1ln2n1n2α2√lnα12n1n2K{lnα12n1n2K−2Δ¯J}]12, (28)

with probability at least

.

###### Remark 2.

If there exist constants and such that :

 maxk{∥Zk−1k−p∥spec}≤(c1−b1)/√γ(2n1−1)(2n2−1)p,

and , we derive the following nonrandom upper bound

 R[(μ,G)||(ˆμ,ˆG)]≤c1n1n2KΔc2 (K2ln2n1n2α2)1/2⋅(c1−a1)2⋅max{|ln(Δb1)|,|ln(Δc1)|}a1[1−e−(c1−a1)2/(8a1))] (29) (30)
###### Remark 3.

From Remark 2, we observe that, for given data , the upper bound can be regarded as an increasing function of . More precisely, when is fixed, increases with the upper bound on the tensor nuclear norm . It implies that the upper bound of the estimation error grows with that of the tensor nuclear norm. It is a characteristic that we can expect from the low-rank tensor recovery. We note that it is also demonstrated in the experimental results in Section 5.

###### Remark 4.

We observe that by fixing the upper bound (26) tends to as at the rate of .

The proof for Theorem 3

is presented in the Appendix. In the proof, the Kullback-Leibler (KL) divergence and Hellinger distance are defined between two Poisson distributions. For any two Poisson mean

and , the KL -divergence is defined as

 D(p||q):=plog(p/q)−(p−q), (31)

and the Hellinger distance as

 H2(p||q):=2−2exp(−12(√p−√q)2). (32)

Then, a lower bound is derived for using the Helliger distance and Lemma 8 in [9]. Furthermore, we establish an upper bound on the sum of the KL divergence using the Azuma Hoeffding inequality. By combining the lower and upper bound, we obtain the upper bound for the estimation error.

Corollary 1 immediately follows from Theorem 3. In particular, it demonstrates the data driven upper bound for the sum of KL divergence between the estimated and the true intensity functions.

###### Corollary 1.

Assume and are the optimal solution to (24). With the notations defined in Theorem 3, for every , , it holds that

 K∑k=1n1∑i=1n2∑j=1D(λijk(μ,G)||λijk(ˆμ,ˆ%$G$))n1n2K (33) ≤ 2|log(¯J)|KΔ[Kln2(α12n1n2K)ln2n1n2α2+Δ¯JKln2n1n2Kα1ln2n1n2α2 (34) +Kln2n1n2Kα1ln2n1n2α2[lnα12n1n2K(lnα12n1n2K−2Δ¯J)]1/2]1/2, (35)

with probability at least .

## 4 Algorithm

For the proposed convex optimization problem (24), we apply the alternating direction method of multipliers (ADMM) and the majorization-minimization (MM) algorithms. Based on the ADM4 algorithm proposed in [25], we design our algorithm for the problem (24). First, define the closed and convex sets:

 Γ1:={μ | a1≤μij≤b1 ∀(i,j)∈⟦n1⟧×⟦n2⟧}, (36) Γ2:={G | a2≤Gijk≤b2 ∀(i,j,k)∈⟦n1⟧×⟦n2⟧×⟦K⟧}, (37)

where . Then, the problem (24) can be written as

 min F(μ,G)+τ∥G∥TNN (38) subject to μ∈Γ1, G∈Γ2. (39)

By introducing auxiliary variables and , (38) can be equivalently expressed as

 min F(μ,G)+τ∥R∥TNN (40) subject to m∈Γ1, G∈Γ2, (41) μ=m, G=G, G,=R. (42)

Consider the following augmented Lagrangian of (40) to apply the ADMM:

 Lρ(G,μ,R,G,m,Y1,Y2,Y3):=F(μ,G)+ τ∥R∥TNN+ψΓ1(m)+ψΓ2(G) (43) +ρ⟨Y1,G−R1⟩+ρ⟨Y2,G−G⟩+ρ⟨Y3,μ−m⟩ (44) +ρ2∥G−R1∥2F+ρ2∥G−G∥2F+ρ2∥μ−m∥2F, (45)

where , and is are the dual variables, is a penalty parameter. Here, and are defined as

 ψΓ1(m):={0 if m∈Γ1,+∞otherwise,ψΓ2(G):={0 if G∈Γ2,+∞otherwise. (46)

The ADMM for solving the above augmented Lagrangian problem requires the following steps:

 (Gt+1,μt+1) =argminμ,GLρ(G,μ,Rt,Gt,mt,Yt1,Yt2,Yt3), (47) Rt+1 =argminRLρ(Gt+1,μt+1,R,Gt,mt,Yt1,Yt2,Yt3), (48) Gt+1 =argminGLρ(Gt+1,μt+1,Rt+1,G,mt,Yt1,Yt2,Yt3), (49) mt+1 =argminm,GLρ(Gt+1,μt+1,Rt+1,Gt+1,m,Yt1,Yt2,Yt3), (50) Yt+11 =Yt1+ρ(Gt+1−Rt+1), (51) Yt+12 =Yt2+ρ(Gt+1−Gt+1), (52) Yt+13 =Yt3+ρ(μt+1−mt+1). (53)

We start with deriving the updating step for and as follows. Note that for the step (47), the following optimization problem is considered:

 minμ,Gg(μ,G):= K∑k′=1n1∑i′=1n2∑j′=1(Δλi′j′k′(μ,G% )−Zi′j′k′log(Δλi′j′k′(μ,G))) (54) +ρ2∥G−Rt+Yt1∥2F+ρ2∥G−Gt+Yt2∥2F+ρ2∥μ−mt+Yt3∥2F. (55)

Since no closed form solutions exits, we apply the MM algorithm as in [25]. For any , let be a function such that

 g(μ,G) ≤Q(μ,G;μ(q),G(q)) (56) g(μ(q),G(q)) =Q(μ(q),G(q);μ(q),G(q)), (57)

where and are estimates of and . Then, we can obtain optimal solutions of convex problem(55) by using the iterative procedure:

 μ(q+1),G(q+1)=argminμ,GQ(μ,G;μ(q),G(q)), (58)

where the sets

 Ω={k | Zijk≠0 for some i and j} and l(k)={(i,j) | Zijk≠0}. (59)

Define that satisfies (56) and (57) as follows:

 Q(G,μ;G(q),μ(q))=−∑k′∈Ω∑(i′,j′)∈l(k′)[Zi′j′k′logΔ (60) +Zi′j′k′⎛⎝pi′j′k′logμi′j′pi′j′k′+k′−1∑k=k′−p∑(i,j)∈l(k)pijk,i′j′k′logGi′−i+n1,j′−j+n2,k′−kZijkpijk,i′j′k′⎞⎠⎤⎦ (61) +n3Δn1∑i′=1n2∑j′=1μi′j′+K∑k′=1n1∑i′=1n2∑j′=1Δ⎛⎝k′−1∑k=k′−p∑(i,j)∈l(k)Gi′