Poisson intensity estimation with reproducing kernels

by   Seth Flaxman, et al.
University of Oxford

Despite the fundamental nature of the inhomogeneous Poisson process in the theory and application of stochastic processes, and its attractive generalizations (e.g. Cox process), few tractable nonparametric modeling approaches of intensity functions exist, especially when observed points lie in a high-dimensional space. In this paper we develop a new, computationally tractable Reproducing Kernel Hilbert Space (RKHS) formulation for the inhomogeneous Poisson process. We model the square root of the intensity as an RKHS function. Whereas RKHS models used in supervised learning rely on the so-called representer theorem, the form of the inhomogeneous Poisson process likelihood means that the representer theorem does not apply. However, we prove that the representer theorem does hold in an appropriately transformed RKHS, guaranteeing that the optimization of the penalized likelihood can be cast as a tractable finite-dimensional problem. The resulting approach is simple to implement, and readily scales to high dimensions and large-scale datasets.



page 18


Shrinkage priors for nonparametric Bayesian prediction of nonhomogeneous Poisson processes

We consider nonparametric Bayesian estimation and prediction for nonhomo...

Erlang mixture modeling for Poisson process intensities

We develop a prior probability model for temporal Poisson process intens...

The theory and application of penalized methods or Reproducing Kernel Hilbert Spaces made easy

The popular cubic smoothing spline estimate of a regression function ari...

Additive Poisson Process: Learning Intensity of Higher-Order Interaction in Stochastic Processes

We present the Additive Poisson Process (APP), a novel framework that ca...

Structure-preserving numerical methods for stochastic Poisson systems

We propose a class of numerical integration methods for stochastic Poiss...

Sparse Representations of Positive Functions via Projected Pseudo-Mirror Descent

We consider the problem of expected risk minimization when the populatio...

Intensity Estimation on Geometric Networks with Penalized Splines

In the past decades, the growing amount of network data has lead to many...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Poisson processes are ubiquitous in statistical science, with a long history spanning both theory (e.g. [kingman1993poisson]) and applications (e.g. [diggle2013spatial]), especially in the spatial statistics and time series literature. Despite their ubiquity, fundamental questions in their application to real datasets remain open. Namely, scalable nonparametric models for intensity functions of inhomogeneous Poisson processes are not well understood, especially in multiple dimensions since the standard approaches, based on kernel smoothing, are akin to density estimation and hence scale poorly with dimension. In this contribution, we propose a step towards such scalable nonparametric modeling and introduce a new Reproducing Kernel Hilbert Space (RKHS) formulation for inhomogeneous Poisson process modeling, which is based on the Empirical Risk Minimization (ERM) framework. We model the square root of the intensity as an RKHS function and consider a risk functional given by a penalized version of the inhomogeneous Poisson process likelihood. However, standard representer theorem arguments do not apply directly due to the form of the likelihood. Namely, the fundamental difference arises since the observation that no points occur in some region is just as important as the locations of the points that do occur. Thus, the likelihood depends not only on the evaluations of the intensity at the observed points, but also on its integral across the domain of interest. As we will see, this difficulty can be overcome by appropriately adjusting the RKHS under consideration. We prove a version of the representer theorem in this adjusted RKHS, which coincides with the original RKHS as a space of functions but has a different inner product structure. This allows us to cast the estimation problem as an optimization over a finite-dimensional subspace of the adjusted RKHS. The derived method is demonstrated to give better performance than a naïve unadjusted RKHS method which resorts to an optimization over a subspace without representer theorem guarantees. We describe cases where adjusted RKHS can be described with explicit Mercer expansions and propose numerical approximations where Mercer expansions are not available. We observe strong performance of the proposed method on a variety of synthetic, environmental, crime and bioinformatics data.

2 Background and related work

2.1 Poisson process

We briefly state relevant definitions for point processes over domains , following [cressie2011]. For Lebesgue measurable subsets , denotes the number of events in . is a stochastic process characterizing the point process. Our focus is on providing a nonparametric estimator for the first-order intensity of a point process, which is defined as:


The inhomogeneous Poisson process is driven solely by the intensity function :


In the homogeneous Poisson process, is constant, so the number of points in any region simply depends on the volume of , which we denote :


For a given intensity function , the likelihood of a set of points observed over a domain is given by:


2.2 Reproducing Kernel Hilbert Spaces

Given a non-empty domain and a positive definite kernel function , there exists a unique reproducing kernel Hilbert space (RKHS) . An RKHS is a space of functions , in which evaluation is a continuous functional, meaning it can be represented by an inner product for all (this is known as the reproducing property), cf. BerTho04. While is in most interesting cases an infinite-dimensional space of functions, due to the classical representer theorem [KIMELDORF1971], [scholkopf2002learning, Section 4.2], optimization over is typically a tractable finite-dimensional problem. In particular, if we have a set of observations , and consider the problem:


where depends on through its evaluations on the set of observations only, and is a non-decreasing function of the RKHS norm of , there exists a solution to Eq. (2.5) of the form , and the optimization can thus be cast in terms of the so-called dual coefficients . This formulation is widely used in the framework of regularized Empirical Risk Minimization (ERM) for supervised learning, where

is the empirical risk corresponding to a loss function

, e.g. squared loss for regression, logistic or hinge loss for classification.

If domain is compact and kernel is continuous, one can assign to its integral kernel operator , given by

, which is positive, self-adjoint and compact. There thus exists an orthonormal set of eigenfunctions


and the corresponding eigenvalues

, with as . This spectral decomposition of leads to Mercer’s representation of kernel function [scholkopf2002learning, Section 2.2]:


with uniform convergence on . Any function can then be written as where .

Note that above we have focused on Mercer expansion with respect to the Lebesgue measure, but other base measures are also often considered in literature, e.g. [rasmussen2006gaussian, section 4.3.1].

2.3 Related work

The classic approach to nonparametric intensity estimation is based on smoothing kernels [ramlau-hansen1983, diggle1985kernel]

and has a form closely related to the kernel density estimator:


where is a smoothing kernel (related to but distinct from the RKHS kernels described in the previous section), that is, any bounded function integrating to . Early work in this area focused on edge-corrections and methods for choosing the bandwidth [diggle1985kernel, berman1989estimating, brooks1991asymptotic]. Connections with RKHS have been considered by, for example, bartoszynski1981some who use a maximum penalized likelihood approach based on Hilbert spaces to estimate the intensity of a Poisson process. There is long literature on maximum penalized likelihood approaches to density estimation, which also contain interesting connections with RKHS, e.g. [silverman1982].

Much recent work on estimating intensities for point processes has focused on Bayesian approaches to modeling Cox processes. The log Gaussian Cox Process [moller1998log] and related parameterizations of Cox (doubly stochastic) Poisson processes in terms of Gaussian processes have been proposed, along with Monte Carlo [adams2009tractable, diggle2013spatial, teh2011gaussian], Laplace approximation [illian2012toolbox, cunningham2008fast, flaxman2015fast] and variational [lloyd2015variational, yves2015scalable] inference schemes.

Another related body of literature concerns Cox processes with intensities parameterized as the sum of squares of Gaussian processes, called the permanent process [mccullagh2006permanental]. Interestingly, calculating the density of the permanent process relies on a kernel transformation similar to the one we propose below. Unlike these approaches, however, we are not working in a doubly stochastic (Cox process) framework; rather we are taking a penalized maximum likelihood estimation perspective to estimate the intensity of an inhomogeneous Poisson process. As future work, it would be worthwhile to explore deeper connections between our formulation and the permanent process, e.g. by considering an RKHS formulation of Cox processes or by considering an inhomogeneous Poisson process whose intensity is the sum of squares of functions in an RKHS.

3 Proposed method and kernel transformation

Let be a compact domain of observations. Let be a continuous positive definite kernel, and its corresponding RKHS of functions . We model the intensity function of an inhomogeneous Poisson process as:


which is parameterized by and an additional scale parameter . The flexibility of choosing means that we can encode structural assumptions of our domain, e.g. periodicity in time or periodic boundary conditions (see Section 4.1.1). Note that we have squared to ensure that the intensity is non-negative on , a pragmatic choice that has previously appeared in the literature (e.g. [lloyd2015variational]). While we lose identifiability (since and are equivalent), as shown below we end up with a finite dimensional, and thus tractable, optimization problem.

The rationale for including is that it allows us to decouple the overall scale and units of the intensity (e.g. number of points per hour versus number of points per year) from the penalty on the complexity of which arises from the classical regularized Empirical Risk Minimization framework (and which should depend only on how complex, i.e. “wiggly” is).

We use the inhomogeneous Poisson process likelihood from Eq. (2.4) to write the log-likelihood of a Poisson process corresponding to the observations , for , and intensity :


We will consider the problem of minimization of the penalized negative log likelihood, where the regularization term corresponds to the squared Hilbert space norm of in parametrization Eq. (3.1):


This objective is akin to a classical regularized empirical risk minimization framework over RKHS: there is a term that depends on evaluations of at the observed points as well as a term corresponding to the RKHS norm. However, the representer theorem does not apply directly to Eq. (3.3): since there is also a term given by the -norm of , there is no guarantee that there is a solution of Eq. (3.3) that lies in . We will show that Eq. (3.3) fortunately still reduces to a finite-dimensional optimization problem corresponding to a different kernel function which we define below.

Using the Mercer expansion of in Eq. (2.6), we can write the objective Eq. (3.3) as follows:


The last two terms can now be merged together, giving

Now, if we define kernel to be the kernel corresponding to the integral operator , i.e., is given by:

we see that:


Thus, we have merged the two squared norm terms into a squared norm in a new RKHS. We note that a similar idea has previously been used to modify Gaussian process priors in [Csato2001], albeit in a different context, and that a similar transformation appears in the expression for the distribution of a permanent process [mccullagh2006permanental]. We are now ready to state the representer theorem in terms of kernel .

Theorem 1.

There exists a solution of Eq. (3.3) for observations , which takes the form .


Since if and only if , i.e. , we have that the two spaces correspond to exactly the same set of functions. Optimization over is therefore equivalent to optimization over .

The proof now follows by applying the classical representer theorem in to the representation of the objective function in Eq. (3.6). We decompose as the sum of two functions:


where is orthogonal in to the span of . We prove that the first term in the objective given in Eq. (3.6), , is independent of . It depends on only through the evaluations for all . Using the reproducing property we have:


where the last step is by orthogonality. Next we substitute into the regularization term:


Thus, the choice of has no effect on the first term in and a non-zero can only increase the second term , so we conclude that and that is the minimizer. ∎

Remark 1. The notions of the inner product in and are different and thus in general .

Remark 2. Notice that unlike in a standard ERM setting, does not recover the unpenalized risk, because appears in . Notice further that the overall scale parameter also appears in . This is important in practice, because it allows us to decouple the scale of the intensity (which is controlled by ) from its complexity (which is controlled by ).

Illustration. The eigenspectrum of where is a squared exponential kernel is shown in Figure 1 for various settings of and . Reminiscent of spectral filtering studied by Muandet2014_spectral, in the top plot we see that depending on the settings of and , eigenvalues of are shrunk or inflated as compared to which is shown in black. In the bottom plot, the values of are shown for the same set of kernels.

Figure 1: Eigenspectrum of (top) and values of (bottom) for various settings of and .

4 Computation of

In this section, we consider first the case in which an explicit Mercer expansion is known, and then we consider the more commonly encountered situation in which we only have access to the parametric form of the kernel , so we must approximate . We show experimentally that our approximation is very accurate by considering the Sobolev kernel, which can be expressed in both ways.

4.1 Explicit Mercer Expansion

We start by assuming that we have a kernel with an explicit Mercer expansion with respect to a base measure of interest (usually the Lebesgue measure on

), so we have eigenvectors

and eigenvalues :


with an at most countable index set . Given and we can calculate:


up to a desired precision as informed by the spectral decay in . Below we consider kernels for which explicit Mercer expansions are known: a kernel on the Sobolev space

with a periodic boundary condition, the squared exponential kernel, and the Brownian bridge kernel. We also show how our formulation can be extended to multiple dimensions using a tensor product formulation. Although not practical for large datasets, the Mercer expansions given below, summing terms up to

(for which the error is less than ), can be used to evaluate approximations for the cases in which Mercer expansions are not available.

4.1.1 Sobolev space on with a periodic boundary condition

We consider a kernel on the Sobolev space on with a periodic boundary condition, proposed by wahba1990spline and recently used in bach2015equivalence. The kernel is given by:

where denotes the order of the Sobolev space and is the Bernoulli polynomial of degree applied to the fractional part of . The corresponding RKHS is the space of functions on with absolutely continuous and square integrable satisfying a periodic boundary condition , . For more details, see [wahba1990spline, Chapter 2].

Bernoulli polynomials admit a simple form for low degrees. In particular,

Moreover, note that:

Thus, the desired Mercer expansion (with respect to the Lebesgue measure) is given by with eigenfunctions and for , , and corresponding eigenvalues , .

Now, the adjusted kernel from (4.2) is given by

4.1.2 Squared exponential kernel

A Mercer expansion for the squared exponential kernel was proposed in [zhu1997gaussian] and refined in [fasshauer2012stable]. However, this expansion is with respect to a Gaussian measure on , i.e., it consists of eigenfunctions which form an orthonormal set in where . The formalism can therefore be used to estimate Poisson intensity functions with respect to such Gaussian measure. In the classical framework, where the intensity is with respect to a Lebesgue measure, numerical approximations of Mercer expansion, as described in Section 4.2 are needed. Following the exposition in [rasmussen2006gaussian, section 4.3.1] and the relevant errata111http://www.gaussianprocess.org/gpml/errata.html we parameterize the kernel as:


The Mercer expansion with respect to then has the eigenvalues


and eigenfunctions


where is the -th order (physicist’s) Hermite polynomial, , , , , and . Thus we have the following eigenvalues for :


while the eigenfunctions remain the same.

4.1.3 Brownian Bridge kernel

This is the kernel on , given by

with the eigenvalues and eigenfunctions in the Mercer expansion with respect to Lebesgue measure


Thus one can form

The functions in the corresponding RKHS are pinned to zero at both ends of the segment.

4.1.4 Extending the Mercer expansion to multiple dimensions

The extension of any kernel to higher dimensions can be constructed by considering tensor product spaces: (where and

could potentially be different kernels with different hyperparameters). If

has eigenvalues and eigenfunctions and has eigenvalues and eigenfunctions , then the eigenvalues of the product space are then given by the Cartesian product , and similarly the eigenfunctions are given by . Our regularized kernel has the following Mercer expansion:


Notice that is the kernel corresponding to the integral operator which is different than .

Notice that this approach does not lead to a method that scales well in high dimensions, which is further motivation for the approximations developed below.

4.2 Numerical approximation when Mercer expansions are not available

We propose an approximation to given access only to a kernel for which we do not have an explicit Mercer expansion with respect to Lebesgue measure. We only assume that we can form Gram matrices corresponding to and calculate their eigenvectors and eigenvalues. As a side benefit, this representation will also enable scalable computations through Toeplitz / Kronecker algebra [cunningham2008fast, gilboa2013scaling, flaxman2015fast] or primal reduced rank approximations [williams2001using].

Let us first consider the one-dimensional case and construct a uniform grid on . Then the integral kernel operator can be approximated with the (scaled) kernel matrix , where , and thus is approximately . Note that for the general case of multidimensional domains , the kernel matrix would have to be multiplied by . Without loss of generality we assume below.

We are not primarily interested in evaluations of on this grid, but on the observations . Simply adding the observations into the kernel matrix is not an option however, as it changes the base measure with respect to which the integral kernel operator is to be computed (Lebesgue measure on ). Thus, we consider the relationship between the eigendecomposition of and the eigenvalues and eigenfunctions of the integral kernel operator .

Let be the eigenvalue/eigenvector pairs of the matrix , i.e., its eigendecomposition is given by . Then the estimates of the eigenvalues/eigenfunctions of the integral operator are given by the Nyström method (see rasmussen2006gaussian and references therein, especially baker1977numerical):


with , leading to:

For an estimate of the whole matrix we thus have


The above is reminiscent of the Nyström method [williams2001using] proposed for speeding up Gaussian process regression. It has computational cost . A reduced rank representation for Eq. (4.11) is straightforward by considering only the top eigenvalues/eigenvectors of . Furthermore, a primal representation with the features corresponding to kernel is readily available and is given by


which allows linear computational cost in the number of observations.

For dimensions, one can exploit Kronecker and Toeplitz algebra approaches. Assuming that the matrix corresponds to a Cartesian product structure of the one-dimensional grids of size , one can write . Thus, the eigenspectrum can be efficiently calculated by eigendecomposing each of the smaller matrices and then applying standard Kronecker algebra, thereby avoiding ever having to form the prohibitively large matrix . For regular grids and stationary kernels, each small matrix will be Toeplitz structured, yielding further efficiency gains [wilson2015thoughts]. The resulting approach thus scales linearly in dimension .

An even simpler alternative to the above is to sample the points uniformly from the domain using Monte Carlo or Quasi-Monte Carlo (see [oates2016control] for a discussion in the context of RKHS). We found this approach to work well in practice in high-dimensions (), even when was fixed, meaning that the scaling was effectively independent of the dimension .

Using the Sobolev kernel in Sec. 4.1.1, we compared the exact calculation of with , , and to our approximate calculation. For illustration we compared a coarse grid of size 10 on the unit interval (left) to a finer grid of size 100. The RMSE was 2E-3 for the coarse grid and 1.6E-5 for the fine grid, as shown in Fig. 2. In the same figure we compared the exact calculation of with , , and to our Nyström-based approximation, where distribution. The RMSE was 0.98E-3. A low-rank approximation using only the top eigenvalues gives the RMSE of 1.6E-2. As Figure 2, demonstrates, good approximation is possible with a fairly coarse grid as well as with a low-rank approximation.

Figure 2: Using the Sobolev kernel in Sec. 4.1.1, we compared the exact calculation of with , , and to our approximate calculation. For illustration we tried a coarse grid of size 10 on the unit interval (top left) to a finer grid of size 100 (top right). The RMSE was 2E-3 for the coarse grid and 1.6E-5 for the fine grid. We compare the exact calculation of with , , and to our Nyström-based approximation, where distribution (bottom left). The RMSE was 0.98E-3. A low-rank approximation using only the top eigenvalues gives the RMSE of 1.6E-2 (bottom right).

5 Inference

The penalized risk can be readily minimized with gradient descent.222While the objective is not convex, in practice we observed very fast convergence, and stable results given random starting points. Let and be the Gram matrix corresponding to such that . Then and the gradient of the objective function from (3.6) is given by

where denotes element-wise division. Computing requires

time and memory, and each gradient and likelihood computation requires matrix-vector multiplications which are also

. Overall, the running time is for iterations of the gradient descent method, where is usually very small in practice.

5.1 Hyperparameter selection

Analogously to the classical problem of bandwidth selection in kernel intensity estimation (e.g. [diggle1985kernel, berman1989estimating, brooks1991asymptotic]), some criteria must be adopted in order to select hyperparameters of the kernel and also and . We suggest crossvalidating on the negative log-likelihood of the inhomogeneous Poisson process (i.e. before we introduced the penalty term) from Eq. (3.2). The difficulty with this approach is that we must deal with the integral of the intensity over the domain, which, for our model is generally intractable. As an approximation, we suggest either grid or Monte Carlo integration. Recall that in Section 4.2 we approximated using a set of locations . We can reuse these points to approximate the integral:


As , this approximation is given by .

6 Naïve RKHS model

In this section, we compare the proposed approach, which uses the representer theorem in the transformed kernel , to the naïve one, where a solution to Eq. (3.3) of the form is sought even though the representer theorem in need not hold. Despite being theoretically suboptimal, this is a natural model to consider, and it might perform well in practice.

The corresponding optimization problem is:

While the first and the last term are straightforward to calculate for any , needs to be estimated. As in the previous section, we consider a uniform grid or set of sampled points covering the domain and use approximation


The optimization problem thus reads:


As above, the gradient of this objective with respect to can be readily calculated, and optimized with gradient descent.

7 Experiments

We use cross-validation to choose the hyperparameters in our methods: , the fixed intensity, , the roughness penalty, and the length-scale of the kernel , minimizing the negative log-likelihood as described in Section 5.1.

To calculate RMSE, we either make predictions at a grid of locations and calculate RMSE compared to the true intensity at that grid or for the high-dimensional synthetic example we pick a new uniform sample of locations over the domain and calculate the RMSE at these locations. We used limited memory BFGS in all experiments involving optimization, and found that it converged very quickly and was not sensitive to initial values. Code for our experiments is available at https://github.com/BigBayes/kernelpoisson.

7.1 1-d synthetic Example

We generated a synthetic intensity using the Mercer expansion of a SE kernel with lengthscale , producing a random linear combination of 64 basis functions, weighted with iid draws . In Fig. 7.1 we compare ground truth to estimates made with: our RKHS method with SE kernel, the naïve RKHS approach with SE kernel, and classical kernel intensity estimation with bandwidth selected by crossvalidation. The results are typical of what we observed on 1D and 2D examples: given similar kernel choices, each method performed similarly, and numerically there was not a significant difference in terms of the RMSE compared to the true underlying intensity.

A synthetic dataset, comparing our RKHS method, the naïve model, and kernel smoothing to a synthetic intensity “true”. The rug plot at bottom gives the location of points in the realized point pattern. The RMSE for each method was similar.    

(a) KIE with edge correction
(b) KIE without edge correction
(c) Our RKHS method with
(d) Naïve RKHS method
Figure 3: Location of white oak trees in Lansing, Michigan, smoothed with various approaches. Squared exponential kernels are used throughout. Edge correction makes a noticeable difference for classical kernel intensity estimation. Comparing (a) and (c) it is clear that our method is automatically performing edge correction.
Dataset Kernel intensity estimation Naïve approach Our approach with
Lansing: Black oak (n = 135) 234 233 227
Hickory (n = 703) 1763 1746 1757
Maple (n = 514) 1239 1228 1233
Misc (n = 105) 179 177 172
New Zealand (n = 86) 119 119 119
Red oak (n = 346) 726 726 739
Redwoods in California (n = 62) 79 84 77
Spruces in Saxonia (n = 134) 215 212 212
Swedish pines (n = 71) 91 89 90
Waka national park (n = 504) 1142 1141 1144
White oak (n = 448) 992 992 996
Table 1: Tree Point Patterns from R Package spatstat

7.2 Environmental datasets

Next we demonstrate our method on a collection of two-dimensional environmental datasets giving the locations of trees. Intensity estimation is a standard first step in both exploratory analysis and modelling of these types of datasets, which were obtained from the R package spatstat. We calculated the intensity using various approaches: our proposed RKHS method with with a squared exponential kernel, the naïve RKHS method with squared exponential kernel, and classical kernel intensity estimation (KIE) with edge correction. Each method used a squared exponential kernel. We report average held-out cross-validated likelihoods in Table 1. With the exception of our method performing better on the Red oak dataset, each method had comparable performance. It is interesting to note, however, that our method does not require any explicit edge correction333Because no points are observed outside the window , intensity estimates near the edge are biased downwards [jones1993simple]., because we are optimizing a likelihood which explicitly takes into account the window. A plot of the resulting intensity surfaces for each method and the effect of edge correction are shown in Fig. 3 for the Black oak dataset.

7.3 High dimensional synthetic examples

We generated random intensity surfaces in the unit hypercube for dimensions . The intensity was given by a constant multiplied by the square of the sum of 20 multivariate Gaussian pdfs with random means and covariances. The constant was automatically adjusted so that the number of points in the realizations would be held close to constant, in the range 190-210. We expected this to be a relatively simple synthetic example for kernel intensity estimation with a Gaussian kernel in low dimensions, but not in high dimensions. From each random intensity, we generated two random realizations, and trained our model using 2-fold crossvalidation with these two datasets. We predicted the intensity at a randomly chosen set of points and calculated the mean squared error as compared to the true intensity. For each dimension we repeated this process 100 times comparing kernel intensity estimation, the naïve approach, and our approach with .

Using the same procedure, but a sum of 20 multivariate Student-t distributions with 5 degrees of freedom, random means and covariances, and number of points in the realizations ranging from 10 to 1000, we generated 500 random surfaces, with dimension

. We expected this to be a difficult synthetic example for all of the methods due to a potential for model misspecification, as we continue to use squared exponential kernels, but the intensity surface is non-Gaussian.

As shown in Fig. 6(a) once we reach dimension 9 and above, our RKHS method with begins to outperform kernel intensity estimation, where performance is measured as the fraction of times that the MSE is smaller across 100 random datasets for each . Our method also significantly outperforms the naïve RKHS method as shown in Fig. 6(b). For the difference between the two RKHS methods is not significant. This could be due to the fact that the number of points in the point pattern remains fixed, so the problem becomes very hard in high dimensions.444Note that our experiments are sensitive to the overall number of points in the synthetic point patterns; since kernel density estimation is a consistent method [wied2012consistency], we should expect kernel intensity estimation to become more accurate as the number of points grows. However, consistency in the sense of classical statistics is not necessarily useful in point processes, because our observations are not iid; the number of points that we observe is in fact part of the dataset since it reflects the underlying intensity. As shown in the Fig. 6(c), kernel intensity estimation is almost always better than the naïve RKHS approach, although the difference is not significant in high dimensions.

For the Student-t experiment, as shown in Fig. 6(d)-(f), our RKHS method always outperforms kernel intensity estimation and is better than the naïve method in dimensions below . To assess the amount of improvement, rather than just its statistical significance, we compared the percent improvement in terms of MSE gained by our method versus the competitors, just focusing on in Fig. 7

. On this metric (intuitively, “how much do you expect to improve on average”) our method shows reasonably stable results as compared to KIE, while the performance of the naïve method is revealed to be very variable. Indeed, the standard deviation across the random surfaces for

of the MSE was 56 for both our method and KDE but 166 for the naïve method, perhaps due to overfitting.

7.4 Computational complexity

Using the synthetic data experimental setup, we evaluated the time complexity of our method with respect to dimensionality , number of points in the point pattern dataset , and number of points used to estimate (Fig. 5), confirming our theoretical analysis.

The effect of the dimensionality was negligible in practice, because the main calculations rely only on an Gram matrix whose calculation is relatively fast even for high dimensions. Our method’s time complexity scales as as shown in Fig. 4 (but as discussed in Section 4.2, a primal representation is available which would give linear scaling.) where we used sample points to estimate . While a small worked well in practice, we investigated much larger values of . As shown in Fig. 5 the time complexity scaled as where the number of points was fixed to be 150; note that we fixed the rank of the eigendecomposition to be 20.

Figure 4: Run-time of our method versus number of points in the point pattern dataset.
Figure 5: Run-time of our method versus number of sample points used to calculate .
Figure 6: Three methods were compared: our RKHS method, the naïve RKHS method, and kernel intensity estimation, based on 100 random surfaces for each dimension

in two experimental setups. In (a)-(c), the intensity surface was the squared sum of skewed multivariate Gaussians. In (d)-(f) the surface was a mixture of skewed multivariate Student-t distributions, with 5 degrees of freedom. In (a) and (d): comparison of our RKHS method versus KIE. In (b) and (e): our RKHS method versus the naïve RKHS method. In (c) and (f): comparison of KIE and the naïve RKHS approach. We used squared exponential kernels for all methods. In the Gaussian case (a)-(c), our method significantly outperforms kernel intensity estimation as the dimension increases, and outperforms the naïve method throughout. Kernel intensity estimation almost always outperforms the naïve approach. In the Student-t case (d)-(f), our method always outperforms kernel intensity estimation, and outperforms the naïve approach until very high dimensions. Neither kernel intensity estimation nor the naïve approach are consistently better than each other.

Figure 7: To understand the practical (as opposed to statistical) significance of the results in Fig. 6(d)-(f), where we generated random surfaces by squaring the sums of multivariate Student-t distributions, we considered dimension , in which our RKHS method was better than both the naïve method and kernel intensity estimation (KIE) but there was not a significant difference between KIE and the naïve method, and calculated the percent improvement in terms of MSE comparing our RKHS method to the naïve method (left) and our RKHS method to and kernel intensity estimation (KIE) (right). The improvement of our method over KIE is apparent, albeit perhaps only modest in this example. Meanwhile, our method is sometimes quite a bit better than the naïve method, which is often very inaccurate.
Figure 8: Log-likelihood for various frequencies of a periodic spatiotemporal kernel in a dataset of 18,441 geocoded, date-stamped theft events from Chicago, using our RKHS model. The dataset is for 12 weeks starting January 1, 2004, and the maximum log-likelihood is attained when the frequency is 12, meaning that there is a weekly cycle in the data. Results using the naïve model were less sensible, with a maximum at period 1 (indicating no periodicity), with periods 5 and 2 (corresponding to a 16.8 day cycle and a 42 day cycle) also having high likelihoods.

7.5 Spatiotemporal point pattern of crimes

To demonstrate the ability to use domain specific kernels and learn interpretable hyperparameters, we used 12 weeks (84 days) of geocoded, date-stamped reports of theft obtained from Chicago’s data portal (data.cityofchicago.org) starting January 1, 2004, a relatively large spatiotemporal point pattern consisting of 18,441 events. We used the following kernel: which is the product of a separable squared exponential space and decaying periodic time kernel (with frequency in a time domain normalized to range from to ) plus a separable squared exponential space and time kernel. After finding reasonable values for the lengthscales and other hyperparameters of through exploratory data analysis, we used 2-fold cross-validation and calculated average test log-likelihoods for the number of total cycles in the 84 weeks or equivalently a period of length 12 weeks (meaning no cycle), 6 weeks, …, 6 days. These log-likelihoods are shown in Fig. 8; we found that the most likely frequency is 12, or equivalently a period lasting 1 week. This makes sense given known day-of-week effects on crime.

8 Conclusion

We presented a novel approach to inhomogeneous Poisson process intensity estimation using a representer theorem formulation in an appropriately transformed RKHS, providing a scalable approach giving strong performance on synthetic and real-world datasets. Our approach outperformed the classical baseline of kernel intensity estimation and a naïve approach for which the representer theorem guarantees did not hold. In future work, we will consider marked Poisson processes and other more complex point process models, as well as Bayesian extensions akin to Cox process modeling.