Hawkes processes, a type of self- (and mutual) exciting point processes, have gained substantial attention in machine learning and statistics due to its wide applicability in capturing complex interactions of discrete events over space, time, and possibly networks. Such problem arises from many applications such as seismology, criminology , finance [14, 22], genomics  and social network [19, 25]. One advantage of Hawkes process modeling is that interactions between the history and a current event can be represented in the structure easily, as the Hawkes process, in general, has an intensity function consisting of two parts, a baseline intensity, and a triggering effect.
A central problem in Hawkes process modeling is to estimating the triggering effect through the so-called influence functions, which captures how different locations interact with each other. Estimating the triggering effects with Hawkes process models has been conducted in several prior works [17, 18, 13, 1, 25]. Bacry et al. in  and Zhou et al. in  proposed a convex optimization approach with sparse and low-rank assumptions on the interaction adjacency matrix or tensor to estimate an influence in a social network. In particular, they assumed the triggering function as a form of a product between the interaction coefficients and the fixed kernel functions that decay exponentially with continuous time.
Low-dimensional structures are very common in high-dimensional data, such as low-rank matrices and low-rank tensors. A recent motivation for studying of low-rank matrices is due to the matrix completion problem[7, 6, 8, 20, 10]. One of the popular approaches is convex relaxations with a matrix nuclear norm to estimate a low-rank matrix. There has been much effort in modeling with low-rank tensors by extending the results on low-rank matrices, including [11, 12, 2, 3, 5]. However, unlike matrices, the rank of a tensor is not uniquely defined, and it can have multiple ranks such as the CP rank, Tucker rank, tubal rank, and multi-rank. The tubal and multi-rank for the low-rank kernel tensor were proposed by Kilmer et al.  with the algebra for a tensor and the corresponding Fourier-transformed tensor nuclear norm (TNN). In this work, we will consider a low-rank structure for the tensor kernel capturing the interaction, which can be viewed as a low-rank approximation to capture the dominant mode of the influence functions. We will also consider the tubal and multi-rank for the low-rank kernel tensor.
In this paper, we present a new approach to estimate the general influence functions for spatio-temporal Hawkes processes by formulating it as a tensor recovery problem. The location dependent influence function is parameterized as a tensor kernel. We further assume a low-rank structure for the tensor kernel as a low-rank approximation to the true influence functions. We estimate the tensor kernel using maximum likelihood with constrained Fourier transformed nuclear norm (TNN) on the tensor, which leads to a convex optimization problem. Theoretical performance guarantees are presented for squared recovery error. We also present a computationally efficient algorithm based on the alternating direction method of multipliers (ADMM) to solve the optimization problem. We demonstrate the efficiency of our estimation procedure with numerical simulations. Note that our approach is different from the previous works [1, 25] since we consider the influence function as a tensor kernel in the discrete-time and discrete location set-up.
The rest of the paper is organized as follows. Section 2 presents the problem setup. Section 3 contains the main theoretical performance upper bound. Section 4 proposes an ADMM based algorithm to solve the optimization problem. Section 5 contains the numerical study and finally, Section 6 concludes the paper. The proofs are delegated to the Appendix.
2 Spatio-temporal interaction as low-rank tensor
We first introduce some preliminaries. Assume that we have a sequence of discrete event data in the form of , where denote the location and denotes the time of the th event. Consider a spatio-temporal point process whose events occur at time , and the location . Define a counting process , where is the set of nonnegative integers. is the number of events in the region and the time window , where and are the Borel -algebras of and . Define is the -algebras of data before time which is a filtration. In this section, we will first generate an approximation of Hawkes process in discrete time and then formulate our problem.
2.1 Discrete Hawkes processes
The conditional intensity function of a point process is defined as
For Hawkes processes, we can define the conditional intensity function with the following form:
where is the base intensity of location and is the kernel function. To discretize the process in both space and time, we define the “bin counts” over a discrete time horizon :
Let , and be small and . Suppose and are bounded, so there exist , such that and we assume , for some , where is the set of positive integers. Let
For discrete Hawkes process, we have
We will derive the discrete form of the intensity function . In general, has the following form:
where is a non-negative monotone decreasing function, and for large , goes to zero. A standard choice of is a class of exponential kernels, . Thus, according to (1), we have the following approximation for the discretized intensity function with a memory depth :
for , where and are the discretized version of base intensity and the kernel , respectively. Our gaol is to estimate the true parameters and .
Note that this approach is also considered in  to estimate Hawkes processes under discrete-time. However, the difference is that we are dealing with a more complex setting with higher dimensions (two dimensions in location and one in time), and we discretize the location space and time-space simultaneously, leading to the tensor . Our interpretation of the discretized version of the kernel function enables the presence of space-time interactions. Most importantly, we will impose constraints on the rank of the tensor and entrywise bounds upon the estimators. That strategy brings about a better estimation of the Hawkes process when its coefficients are sparse and low-rank, which is a common case. Thus, different from , which employed the projection method on the approximated time series model INAR(), we construct a convex optimization problem based on the maximum likelihood formulation and our corresponding regularization.
To estimate the base intensity matrix and the underlying tensor , we make the following assumptions on the candidate set. First, we assume that each entry of and of has an upper bound, say, and respectively. This assumption is common in the real world, and it is necessary to ensure that our problem is well-posed. Second, we assume that there exist nonnegative constants and such that , and for all . This condition ensures the analysis in the following theoretical part on bounding the estimation errors. Finally, we assume that the tensor has low transformed multi-rank , where , being the th frontal face of , and define . This assumption is based on that high correlations exist within locations and time. A similar approach has been applied to analyze multivariate spatio-temporal data in , where a general low-rank tensor completion model was employed with no additional matrix parameter.
2.2 Low-rank tensor regularization
In this section, we introduce the tensor nuclear norm (TNN), which we use to guarantee the low-rankness of . Before moving on, we first summarize the notations. For matrix , let
be the spectral norm, the largest absolute singular value,be the nuclear norm, the sum of the singular values, be the Frobenius norm, and be the infinity norm. For a 3-way tensor , we denote the th frontal face of by . Let be the tensor obtained by applying the Fast Fourier Transform (FFT) to the mode-3 fibers of . Define the norm of tensor, , . We introduce three operators for a tensor: fold, unfold and bcirc.
For two tensors and , the t-product is defined as
Note that Kilmer et al. 
proposed a singular value decomposition (SVD) method for three way tensors. Based on this tensor SVD, TNN is proposed by Semerci et al.. We review some background material on the tensor SVD mostly referring to . Consider a tensor , the following lemma describes the block diagonalization property of block circulant matrices.
For , . We have
where is the Kronecker product, and are the identity matrices in and , respectively, is the normalized discrete Fourier Transform matrix, denotes the conjugate transpose of F, and s are the frontal faces of the tensor .
The following theorem provides the tensor-SVD for three-way tensors:
[16, Theorem 4.1] Any tensor can be factored as
where and extract the corresponding tensor frontal slice.
where is the SVD. With the above properties, TNN can be defined as follows.
[11, Section 4.2] The tensor nuclear norm for is defined as
Note that the dual norm of the tensor nuclear norm is the tensor spectral norm .
2.3 Maximum likelihood estimation
By our assumptions on the low-rank tensor , we have
Accordingly, the candidate set for the true value is defined as:
We consider a formulation by maximizing the log-likelihood function of the optimization variable and given our observations . The log-likelihood function is given by
Based on the candidate set defined in the previous section, an estimator can be obtained by solving the following convex optimization problem.
Note that the convex optimization problem (24) constrained in a candidate set can also be formulated as a regularized maximum likelihood function problem. Indeed, there exists a constant such that problem (24) equals to
with the natural constraint upon the entries:
for . It follows from the duality theory in optimization (see  for more details). In the algorithm part (Section 4), we use the regularized form.
3 Theoretical guarantee
We present an upper bound for the sum of squared errors of the two estimators, which is defined by
where and are the optimal solution to (24).
Let us first define the “condition number” as in  to describe our theoretical guarantee.
Given and , the condition number is defined by
where () denotes that is a positive semidefinite matrix (a positive definite matrix, respectively). Note that when .
Now, we present our main theorem.
If there exist constants and such that :
and , we derive the following nonrandom upper bound
From Remark 2, we observe that, for given data , the upper bound can be regarded as an increasing function of . More precisely, when is fixed, increases with the upper bound on the tensor nuclear norm . It implies that the upper bound of the estimation error grows with that of the tensor nuclear norm. It is a characteristic that we can expect from the low-rank tensor recovery. We note that it is also demonstrated in the experimental results in Section 5.
We observe that by fixing the upper bound (26) tends to as at the rate of .
The proof for Theorem 3
is presented in the Appendix. In the proof, the Kullback-Leibler (KL) divergence and Hellinger distance are defined between two Poisson distributions. For any two Poisson meanand , the KL -divergence is defined as
and the Hellinger distance as
Then, a lower bound is derived for using the Helliger distance and Lemma 8 in . Furthermore, we establish an upper bound on the sum of the KL divergence using the Azuma Hoeffding inequality. By combining the lower and upper bound, we obtain the upper bound for the estimation error.
Corollary 1 immediately follows from Theorem 3. In particular, it demonstrates the data driven upper bound for the sum of KL divergence between the estimated and the true intensity functions.
Assume and are the optimal solution to (24). With the notations defined in Theorem 3, for every , , it holds that
with probability at least .
For the proposed convex optimization problem (24), we apply the alternating direction method of multipliers (ADMM) and the majorization-minimization (MM) algorithms. Based on the ADM4 algorithm proposed in , we design our algorithm for the problem (24). First, define the closed and convex sets:
where . Then, the problem (24) can be written as
By introducing auxiliary variables and , (38) can be equivalently expressed as
Consider the following augmented Lagrangian of (40) to apply the ADMM:
where , and is are the dual variables, is a penalty parameter. Here, and are defined as
The ADMM for solving the above augmented Lagrangian problem requires the following steps:
We start with deriving the updating step for and as follows. Note that for the step (47), the following optimization problem is considered:
Since no closed form solutions exits, we apply the MM algorithm as in . For any , let be a function such that
where and are estimates of and . Then, we can obtain optimal solutions of convex problem(55) by using the iterative procedure:
where the sets