Prediction of Spatial Point Processes: Regularized Method with Out-of-Sample Guarantees

07/03/2020 ∙ by Muhammad Osama, et al. ∙ Uppsala universitet 0

A spatial point process can be characterized by an intensity function which predicts the number of events that occur across space. In this paper, we develop a method to infer predictive intensity intervals by learning a spatial model using a regularized criterion. We prove that the proposed method exhibits out-of-sample prediction performance guarantees which, unlike standard estimators, are valid even when the spatial model is misspecified. The method is demonstrated using synthetic as well as real spatial data.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spatial point processes can be found in a range of applications from astronomy and biology to ecology and criminology. These processes can be characterized by a nonnegative intensity function which predicts the number of events that occur across space parameterized by [8, 4].

A standard approach to estimate the intensity function of a process is to use nonparametric kernel density-based methods [6, 7]. These smoothing techniques require, however, careful tuning of kernel bandwidth parameters and are, more importantly, subject to selection biases. That is, in regions where no events have been observed, the intensity is inferred to be zero and no measure is readily available for a user to assess the uncertainty of such predictions. More advanced methods infer the intensity by assuming a parameterized model of the data-generating process, such as inhomogeneous Poisson point process models. One popular model is the log-Gaussian Cox process (lgcp) model [9] where the intensity function is modeled as a Gaussian process [22] via a logarithmic link function to ensure non-negativity. However, the infinite dimensionality of the intensity function makes this model computationally prohibitive and substantial effort has been devoted to develop more tractable approximation methods based on gridding [9, 13], variational inference [15, 12]

, Markov chain Monte Carlo

[2] and Laplace approximations [20] for the log and other link functions. A more fundamental problem remains in that their resulting uncertainty measures are not calibrated to the actual out-of-sample variability of the number of events across space. Poor calibration consequently leads to unreliable inferences of the process.

In this paper, we develop a spatially varying intensity interval with provable out-of-sample performance guarantees. Our contributions can be summarized as follows:

  • the interval reliably covers out-of-sample events with a specified probability by building on the conformal prediction framework


  • it is constructed using a predictive spatial Poisson model with provable out-of-sample accuracy,

  • its size appropriately increases in regions with missing data to reflect inherent uncertainty and mitigate sampling biases,

  • the statistical guarantees remain valid even when the assumed Poisson model is misspecified.

Thus the proposed method yields both reliable and informative predictive intervals under a wider range of conditions than standard methods which depend on the assumed model, e.g. lgcp [9], to match the unknown data-generating process.

Notations: denotes the sample mean of . The element-wise Hadamard product is denoted .

2 Problem formulation

Figure 1: Unknown intensity function (solid) expressed in number of counts per unit of area, across a one-dimensional spatial domain which is discretized into regions. Intensity interval with out-of-sample coverage (3) inferred using samples. Estimated intensity function (dashed). Data is missing in the regions and where the intensity interval increases appropriately.

The intensity function of a spatial process is expressed as the number of events per unit area and varies over a spatial domain of interest, , which we equipartition into disjoint regions: and is a common means of modelling continuous inhomogeneous point processes, see [9, 13]. The function determines the expected number of events that occur in region by


where is the region index and is the maximum number of counts.

We observe independent samples drawn from the process,


where the data-generating distribution is unknown. Let the collection of pairwise datapoints be denoted . Given this dataset, our goal is to infer an intensity interval of the unknown spatial point process, which predicts the number of events per unit area at location . See Figure 1 for an illustration in one-dimensional space. A reliable interval will cover a new out-of-sample observation in a region with a probability of at least . That is, for a specified level the out-of-sample coverage is


where is the area of the th region. Since the trivial noninformative interval also satisfies (3), our goal is to construct that is both reliable and informative.

3 Inference method

We begin by showing that an intensity interval with reliable out-of-sample coverage can be constructed using the conformal prediction framework [19]. Note that obtaining tractable and informative intervals in this approach requires learning an accurate predictor in a computationally efficient manner. We develop such a predictor and prove that it has finite-sample and distribution-free performance guarantees. These guarantees are independent of the manner in which space is discretized.

3.1 Conformal intensity intervals


denote a predictor parameterized by a vector

. For a given region , consider a new data point , where represents number of counts and takes a value between . The principle of conformal prediction is to quantify how well this new point conforms to the observed data . This is done by first fitting parameters to the augmented set and then using the score


where is the indicator function and are residuals for all observed data points . When a new residual is statistically indistinguishable from the rest, corresponds to a p-value [19]. On this basis we construct an intensity interval by including all points that conform to the dataset with significance level , as summarized in Algorithm 1. Using [14, thm. 2.1], we can prove that satisfies the out-of-sample coverage (3).

1:Input: Location , significance level , data
2:for all do
3:Set if
4:Update predictor using augmented data
5:Compute score in (4)
6:end for
Algorithm 1 Conformal intensity interval

While this approach yields reliable out-of-sample coverage guarantees, there are two possible limitations:

  1. The residuals can be decomposed as , where the term in brackets is the model approximation error and is an irreducible zero-mean error. Obtaining informative across space requires learned predictors with small model approximation errors.

  2. Learning methods that are computationally demanding render the computation of intractable across space, since the conformal method requires re-fitting the predictor multiple times for each region.

Next, we focus on addressing both limitations.

3.2 Spatial model

We seek an accurate model of , parameterized by . For a given

, we quantify the out-of-sample accuracy of a model by the Kullback-Leibler divergence

per sample,


In general, the unknown intensity function underlying has a local spatial structure and can be modeled as smooth since we expect counts in neighbouring regions to be similar in real-world applications. On this basis, we consider following the class of models,

where is spatial basis vector whose components are given by the cubic b-spline function [21]

(see supplementary material). The Poisson distribution is the maximum entropy distribution for count data and is here parameterized via a latent field

across regions [4, ch. 4.3]. Using a cubic b-spline basis [21], we model the mean in region via a weighted average of latent parameters from neighbouring regions, where the maximum weight in is less than 1. This parameterization yields locally smooth spatial structures and is similar to using a latent process model for the mean as in the commonly used lgcp model [9, sec. 4.1].

The unknown optimal predictive Poisson model is given by


and has an out-of-sample accuracy .

3.3 Regularized learning criterion

We propose learning a spatial Poisson model in using the following learning criterion


where is the log-likelihood, which is convex [18], and is a given vector of regularization weights. The regularization term in (7) not only mitigates overfitting of the model by penalizing parameters in individually, it also yields the following finite sample and distribution-free result.

Theorem 1

Let , then the out-of-sample accuracy of the learned model is bounded as


with a probability of at least

We provide an outline of the proof in Section 3.3.1, while relegating the details to the Supplementary Material. The above theorem guarantees that the out-of-sample accuracy of the learned model (7) will be close to of the optimal model (6), even if the model class (3.2) does not contain the true data-generating process. As is increased, the bound tightens and the probabilistic guarantee weakens, but for a given data set one can readily search for the value of which yields the most informative interval .

The first term of (7) contains inner products which are formed using a regressor matrix. To balance fitting with the regularizing term in (7), it is common to rescale all columns of the regressor matrix to unit norm. An equivalent way is to choose the following regularization weights , see e.g. [3]. We then obtain a predictor as

and predictive intensity interval via Algorithm 1. Setting in (7) yields a maximum likelihood model with less informative intervals, as we show in the numerical experiments section.

3.3.1 Proof of theorem

The minimizer in (7) satisfies


where is the in-sample divergence, corresponding to (5), and .

Using the functional form of the Poisson distribution, we have

Then the gap between the out-of-sample and in-sample divergences for any given model is given by


where the second line follows from using our Poisson model and is a constant. Note that the divergence gap is linear in , and we can therefore relate the gaps for the optimal model with the learned model as follows:



is the gradient of (10

) and we introduce the random variable

for notational simplicity (see supplementary material).

Inserting (9) into (11) and re-arranging yields


where the RHS is dependent on . Next, we upper bound the RHS by a constant that is independent of .

The weighted norm has an associated dual norm

see the supplementary material. Using the dual norm, we have the following inequalities

and combining them with (12), as in [23], yields


when . The probability of this event is lower bounded by


We derive this bound using Hoeffding’s inequality, for which


and . Moreover,

using DeMorgan’s law and the union bound (see supplementary material). Setting , we obtain (14) Hence equation (13) and (14) prove Theorem 1. It can be seen that for , the probability bound on the right hand side increases with .

3.3.2 Minimization algorithm

To solve the convex minimization problem (7) in a computationally efficient manner, we use a majorization-minimization (MM) algorithm. Specifically, let and then we bound the objective in (7) as


where is a quadratic majorizing function of such that , see [18, ch. 5]. Minimizing the right hand side of (16

) takes the form of a weighted lasso regression and can therefore be solved efficiently using coordinate descent. The pseudo-code is given in Algorithm 

2, see the supplementary material for details. The runtime of Algorithm 2 scales as i.e. linear in number of datapoints . This computational efficiency of Algorithm 2 is leveraged in Algorithm 1 when updating the predictor with an augmented dataset . This renders the computation of tractable across space.

1:Input: Data , parameter and
2:Form weights for
5:Form quadratic approximation at :
6:Solve using coordinate descent
9:Output: and
Algorithm 2 Majorization-minimization method

The code for algorithms 1 and 2 are provided on github.

4 Numerical experiments

We demonstrate the proposed method using both synthetic and real spatial data.

4.1 Synthetic data with missing regions

To illustrate the performance of our learning criterion in (7), we begin by considering a one-dimensional spatial domain , equipartitioned into regions. Throughout we use in (7).

Comparison with log-Gaussian Cox process model

We consider a process described by the intensity function


and sample events using a spatial Poisson process model using inversion sampling [5]. The distribution is then Poisson. Using a realization , we compare our predictive intensity interval with a

-credibility interval

obtained by assuming an lgcp model for the [9] and approximating its posterior belief distribution using integrated nested Laplace approximation (inla) [17, 11]. For the cubic b-splines in , the spatial support of the weights in was set to cover all regions.

We consider interpolation and extrapolation cases where the data is missing across

and , respectively. Figures 1(a) and 1(b) show the intervals both cases. While is tighter than in the missing data regions, it has no out-of-sample guarantees and therefore lacks reliability. This is critically evident in the extrapolation case, where becomes noninformative further away from the observed data regions. By contrast, provides misleading inferences in this case.

Comparison with unregularized maximum likelihood model

Next, we consider a three different spatial processes, described by intensity functions

For the first process, the intensity peaks at , the second process is periodic with a period of spatial units, and for the third process the intensity grows monotonically with space . In all three cases, the number of events in a given region is then drawn as

using a negative binomial distribution, with mean given by (

1) and number of failures set to , yielding a dataset . Note that the Poisson model class is misspecified here.

We set the nominal out-of-sample coverage and compare the interval sizes across space and the overall empirical coverage, when using regularized and unregularized criteria (7), respectively. The averages are formed using Monte Carlo simulations.

Figure 1(c) and Table 1 summarize the results of comparison between the regularized and unregularized approaches for the three spatial processes. While both intervals exhibit the same out-of-sample coverage (table 1), the unregularized method results in intervals that are nearly four times larger than those of the proposed method (figure 1(c)) in the missing region.

(a) Interpolation with data missing in
(b) Extrapolation with data missing in
(c) Average interval size with data missing in
Figure 2: (a) Interpolation and (b) extrapolation with (grey) and (green) with , for a given realization of point data (black crosses). The unknown intensity function (red) gives the expected number of events in a region, see (1). (c) Misspecified case with average intensity interval size , using nonzero (blue) and zero (red) regularization weights in (7). Data in is missing. The different markers correspond to three different spatial processes, with intensity functions , and . The out-of-sample coverage (3) was set to be at least and the empirical coverage is given in 1.
Empirical coverage of [%]
Proposed Unregularized
97.05 97.37
91.05 98.32
81.37 95.37
Table 1: Comparison of empirical coverage of , using the proposed regularized vs. the unregularized maximum likelihood method. We target coverage.

4.2 Real data

In this section we demonstrate the proposed method using two real spatial data sets. In two-dimensional space it is challenging to illustrate a varying interval , so for clarity we show its maximum value, minimium value and size as well as compare it with a point estimate obtained from the predictor, i.e.,


Throughout we use in (7).

Hickory tree data

First, we consider the hickory trees data set [1] which consists of coordinates of hickory trees in a spatial domain , shown in Figure 2(a), that is equipartitioned into a regular lattice of hexagonal regions. The dataset contains the observed number of trees in each region. The dashed boxes indicate regions data inside which is considered to be missing. For the cubic b-splines in , the spatial support was again set to cover all regions.

We observe that the point predictor interpolates and extrapolates smoothly across regions and appears to visually conform to the density of the point data. Figures 2(b) and 2(c) provide important complementary information using , whose upper limit increases in the missing data regions, especially when extrapolating in the bottom-right corner, and lower limit rises in the dense regions.

The size of the interval quantifies the predictive uncertainty and we compare it to the credibility interval using the lgcp model as above, cf. Figures 3(a) and 3(b). We note that the sizes increase in different ways for the missing data regions. For the top missing data region, is virtually unchanged in contrast to . While appears relatively tighter than across the bottom-right missing data regions, the credible interval lacks any out-of-sample guarantees that would make the prediction reliable.

Figure 3: # trees per . Nominal coverage set to . The dashed boxes mark missing data regions.
Figure 4: # trees per . Comparison between proposed intensity interval and credibility intensity interval from approximate posterior of lgcp model.

Crime data

Next we consider crime data in Portland police districts [16, 10] which consists of locations of calls-of-service received by Portland Police between January and March 2017 (see figure 4(a)). The spatial region is equipartitioned into a regular lattice of hexagonal regions. The dataset contains the reported number of crimes in each region. The support of the cubic b-spline is taken to be .

The point prediction is shown in Figure 4(a), while Figures 4(b) and 4(c) plot the upper and lower limits of , respectively. We observe that follows the density of the point pattern well, predicting a high intensity of approximately crimes/ in the center. Moreover, upper and lower limits of are both high where point data is dense. The interval tends to being noninformative for regions far away from those with observed data, as is visible in the top-left corner when comparing Figures 4(b) and 4(c).

Figure 5: # crimes per in Portland, USA. Nominal coverage set to .

5 Conclusion

We have proposed a method for inferring predictive intensity intervals for spatial point processes. The method utilizes a spatial Poisson model with an out-of-sample accuracy guarantee and the resulting interval has an out-of-sample coverage guarantee. Both properties hold even when the model is misspecified. The intensity intervals provide a reliable and informative measure of uncertainty of the point process. Its size is small in regions with observed data and grows along missing regions further away from data. The proposed regularized learning criterion also leads to more informative intervals as compared to an unregularized maximum likelihood approach, while its statistical guarantees renders it reliable in a wider range of conditions than standard methods such as lgcp inference. The method was demonstrated using both real and synthetic data.


The work was supported by the Swedish Research Council (contract numbers and ).


  • [1] P. J. Diggle @ lancaster university.
  • [2] R. P. Adams, I. Murray, and D. J. MacKay.

    Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities.


    Proceedings of the 26th Annual International Conference on Machine Learning

    , pages 9–16. ACM, 2009.
  • [3] A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
  • [4] N. Cressie and C. K. Wikle. Statistics for spatio-temporal data. John Wiley & Sons, 2015.
  • [5] L. Devroye. Sample-based non-uniform random variate generation. In Proceedings of the 18th conference on Winter simulation, pages 260–265. ACM, 1986.
  • [6] P. J. Diggle. A kernel method for smoothing point process data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 34(2):138–147, 1985.
  • [7] P. J. Diggle. Statistics analysis of spatial point patterns. Hodder Education Publishers, 2003.
  • [8] P. J. Diggle. Statistical analysis of spatial and spatio-temporal point patterns. Chapman and Hall/CRC, 2013.
  • [9] P. J. Diggle, P. Moraga, B. Rowlingson, B. M. Taylor, et al. Spatial and spatio-temporal log-gaussian cox processes: extending the geostatistical paradigm. Statistical Science, 28(4):542–563, 2013.
  • [10] S. Flaxman, M. Chirico, P. Pereira, and C. Loeffler. Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: a winning solution to the nij" real-time crime forecasting challenge". arXiv preprint arXiv:1801.02858, 2018.
  • [11] J. B. Illian, S. H. Sørbye, and H. Rue. A toolbox for fitting complex spatial point process models using integrated nested laplace approximation (inla). The Annals of Applied Statistics, pages 1499–1530, 2012.
  • [12] S. John and J. Hensman. Large-scale cox process inference using variational fourier features. 2018.
  • [13] O. O. Johnson, P. J. Diggle, and E. Giorgi. A spatially discrete approximation to log-gaussian cox processes for modelling aggregated disease count data. arXiv preprint arXiv:1901.09551, 2019.
  • [14] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
  • [15] C. Lloyd, T. Gunter, M. Osborne, and S. Roberts. Variational inference for gaussian process modulated poisson processes. In International Conference on Machine Learning, pages 1814–1822, 2015.
  • [16] National Institute of Justice. Real-time crime forecasting challenge posting.
  • [17] H. Rue, S. Martino, and N. Chopin. Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. Journal of the royal statistical society: Series b (statistical methodology), 71(2):319–392, 2009.
  • [18] R. Tibshirani, M. Wainwright, and T. Hastie. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015.
  • [19] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005.
  • [20] C. J. Walder and A. N. Bishop. Fast bayesian intensity estimation for the permanental process. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3579–3588. JMLR. org, 2017.
  • [21] L. Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.
  • [22] C. K. Williams and C. E. Rasmussen. Gaussian processes for machine learning, volume 2. MIT Press Cambridge, MA, 2006.
  • [23] R. Zhuang and J. Lederer. Maximum regularized likelihood estimators: A general prediction theory and applications. Stat, 7(1):e186, 2018.


Spatial basis

For space divided into regions with each region denoted by its region index , the spatial basis vector evaluated at is an vector given by

Here is the cubic b-spline in space with two parameters: center and support. The component i.e has its center at the center of region and peak value when evaluated at . The support of determines its value at the neighbouring regions and hence allows to control the local structure in intensity in our model. For details on cubic b-spline see [21].

Dual Norm

Let . By definition of dual norm,

The condition implies

where . Moreover,

Combining this with we get

Hoeffding’s inequality for

We show that is bounded in and hence we can make use of Hoeffding’s inequality to get eq. .

The gradient of eq. evaluated at is

where Given that the maximum number of counts is bounded i.e. , we have

for all . Here .

Union bound and DeMorgan’s Law

Given that , from eq. we get


By DeMorgan’s law,

By union bound,

which implies that

Eq. follows from above.

Minimization Algorithm

Here we derive the majorization-minimization (MM) algorithm that is used to solve eq. . Let and . For the Poisson model class considered in the paper,

where . is convex in since


are an basis matrix and a diagonal matrix respectively and

By convexity of , given an initial estimate , the objective in eq. can be upper bounded as

where is a quadratic majorization function (see [18], ch. 5) of given by

Here and . The diagonal elements of represent the average number of counts in different regions. Given that the counts in any region are bounded i.e. , therefore we have


Therefore starting from an initial estimate , one can minimize the right hand side of (SM1) to obtain then update and repeat until convergence to get final solution of eq. (7) . The pseudocode is given in algorithm .

Furthermore, the right hand side of (SM1) can be transformed into a weighted lasso regression problem and hence can be efficiently solved using coordinate descent algorithm [18]. Letting , the right hand side of (SM1) can be rewritten as

where the first two terms form a weighted lasso regression problem in and the last term is independent of and does not affect the minimization problem. This conclude the derivation of the MM algorithm.