Spatial point processes can be found in a range of applications from astronomy and biology to ecology and criminology. These processes can be characterized by a nonnegative intensity function which predicts the number of events that occur across space parameterized by [8, 4].
A standard approach to estimate the intensity function of a process is to use nonparametric kernel density-based methods [6, 7]. These smoothing techniques require, however, careful tuning of kernel bandwidth parameters and are, more importantly, subject to selection biases. That is, in regions where no events have been observed, the intensity is inferred to be zero and no measure is readily available for a user to assess the uncertainty of such predictions. More advanced methods infer the intensity by assuming a parameterized model of the data-generating process, such as inhomogeneous Poisson point process models. One popular model is the log-Gaussian Cox process (lgcp) model  where the intensity function is modeled as a Gaussian process  via a logarithmic link function to ensure non-negativity. However, the infinite dimensionality of the intensity function makes this model computationally prohibitive and substantial effort has been devoted to develop more tractable approximation methods based on gridding [9, 13], variational inference [15, 12]
, Markov chain Monte Carlo and Laplace approximations  for the log and other link functions. A more fundamental problem remains in that their resulting uncertainty measures are not calibrated to the actual out-of-sample variability of the number of events across space. Poor calibration consequently leads to unreliable inferences of the process.
In this paper, we develop a spatially varying intensity interval with provable out-of-sample performance guarantees. Our contributions can be summarized as follows:
it is constructed using a predictive spatial Poisson model with provable out-of-sample accuracy,
its size appropriately increases in regions with missing data to reflect inherent uncertainty and mitigate sampling biases,
the statistical guarantees remain valid even when the assumed Poisson model is misspecified.
Thus the proposed method yields both reliable and informative predictive intervals under a wider range of conditions than standard methods which depend on the assumed model, e.g. lgcp , to match the unknown data-generating process.
Notations: denotes the sample mean of . The element-wise Hadamard product is denoted .
2 Problem formulation
The intensity function of a spatial process is expressed as the number of events per unit area and varies over a spatial domain of interest, , which we equipartition into disjoint regions: and is a common means of modelling continuous inhomogeneous point processes, see [9, 13]. The function determines the expected number of events that occur in region by
where is the region index and is the maximum number of counts.
We observe independent samples drawn from the process,
where the data-generating distribution is unknown. Let the collection of pairwise datapoints be denoted . Given this dataset, our goal is to infer an intensity interval of the unknown spatial point process, which predicts the number of events per unit area at location . See Figure 1 for an illustration in one-dimensional space. A reliable interval will cover a new out-of-sample observation in a region with a probability of at least . That is, for a specified level the out-of-sample coverage is
where is the area of the th region. Since the trivial noninformative interval also satisfies (3), our goal is to construct that is both reliable and informative.
3 Inference method
We begin by showing that an intensity interval with reliable out-of-sample coverage can be constructed using the conformal prediction framework . Note that obtaining tractable and informative intervals in this approach requires learning an accurate predictor in a computationally efficient manner. We develop such a predictor and prove that it has finite-sample and distribution-free performance guarantees. These guarantees are independent of the manner in which space is discretized.
3.1 Conformal intensity intervals
denote a predictor parameterized by a vector. For a given region , consider a new data point , where represents number of counts and takes a value between . The principle of conformal prediction is to quantify how well this new point conforms to the observed data . This is done by first fitting parameters to the augmented set and then using the score
where is the indicator function and are residuals for all observed data points . When a new residual is statistically indistinguishable from the rest, corresponds to a p-value . On this basis we construct an intensity interval by including all points that conform to the dataset with significance level , as summarized in Algorithm 1. Using [14, thm. 2.1], we can prove that satisfies the out-of-sample coverage (3).
While this approach yields reliable out-of-sample coverage guarantees, there are two possible limitations:
The residuals can be decomposed as , where the term in brackets is the model approximation error and is an irreducible zero-mean error. Obtaining informative across space requires learned predictors with small model approximation errors.
Learning methods that are computationally demanding render the computation of intractable across space, since the conformal method requires re-fitting the predictor multiple times for each region.
Next, we focus on addressing both limitations.
3.2 Spatial model
We seek an accurate model of , parameterized by . For a given
, we quantify the out-of-sample accuracy of a model by the Kullback-Leibler divergenceper sample,
In general, the unknown intensity function underlying has a local spatial structure and can be modeled as smooth since we expect counts in neighbouring regions to be similar in real-world applications. On this basis, we consider following the class of models,
where is spatial basis vector whose components are given by the cubic b-spline function 
(see supplementary material). The Poisson distribution is the maximum entropy distribution for count data and is here parameterized via a latent fieldacross regions [4, ch. 4.3]. Using a cubic b-spline basis , we model the mean in region via a weighted average of latent parameters from neighbouring regions, where the maximum weight in is less than 1. This parameterization yields locally smooth spatial structures and is similar to using a latent process model for the mean as in the commonly used lgcp model [9, sec. 4.1].
The unknown optimal predictive Poisson model is given by
and has an out-of-sample accuracy .
3.3 Regularized learning criterion
We propose learning a spatial Poisson model in using the following learning criterion
where is the log-likelihood, which is convex , and is a given vector of regularization weights. The regularization term in (7) not only mitigates overfitting of the model by penalizing parameters in individually, it also yields the following finite sample and distribution-free result.
Let , then the out-of-sample accuracy of the learned model is bounded as
with a probability of at least
We provide an outline of the proof in Section 3.3.1, while relegating the details to the Supplementary Material. The above theorem guarantees that the out-of-sample accuracy of the learned model (7) will be close to of the optimal model (6), even if the model class (3.2) does not contain the true data-generating process. As is increased, the bound tightens and the probabilistic guarantee weakens, but for a given data set one can readily search for the value of which yields the most informative interval .
The first term of (7) contains inner products which are formed using a regressor matrix. To balance fitting with the regularizing term in (7), it is common to rescale all columns of the regressor matrix to unit norm. An equivalent way is to choose the following regularization weights , see e.g. . We then obtain a predictor as
3.3.1 Proof of theorem
Using the functional form of the Poisson distribution, we have
Then the gap between the out-of-sample and in-sample divergences for any given model is given by
where the second line follows from using our Poisson model and is a constant. Note that the divergence gap is linear in , and we can therefore relate the gaps for the optimal model with the learned model as follows:
is the gradient of (10
) and we introduce the random variablefor notational simplicity (see supplementary material).
where the RHS is dependent on . Next, we upper bound the RHS by a constant that is independent of .
The weighted norm has an associated dual norm
see the supplementary material. Using the dual norm, we have the following inequalities
when . The probability of this event is lower bounded by
We derive this bound using Hoeffding’s inequality, for which
and . Moreover,
using DeMorgan’s law and the union bound (see supplementary material). Setting , we obtain (14) Hence equation (13) and (14) prove Theorem 1. It can be seen that for , the probability bound on the right hand side increases with .
3.3.2 Minimization algorithm
) takes the form of a weighted lasso regression and can therefore be solved efficiently using coordinate descent. The pseudo-code is given in Algorithm2, see the supplementary material for details. The runtime of Algorithm 2 scales as i.e. linear in number of datapoints . This computational efficiency of Algorithm 2 is leveraged in Algorithm 1 when updating the predictor with an augmented dataset . This renders the computation of tractable across space.
4 Numerical experiments
We demonstrate the proposed method using both synthetic and real spatial data.
4.1 Synthetic data with missing regions
Comparison with log-Gaussian Cox process model
We consider a process described by the intensity function
and sample events using a spatial Poisson process model using inversion sampling . The distribution is then Poisson. Using a realization , we compare our predictive intensity interval with aobtained by assuming an lgcp model for the  and approximating its posterior belief distribution using integrated nested Laplace approximation (inla) [17, 11]. For the cubic b-splines in , the spatial support of the weights in was set to cover all regions.
We consider interpolation and extrapolation cases where the data is missing acrossand , respectively. Figures 1(a) and 1(b) show the intervals both cases. While is tighter than in the missing data regions, it has no out-of-sample guarantees and therefore lacks reliability. This is critically evident in the extrapolation case, where becomes noninformative further away from the observed data regions. By contrast, provides misleading inferences in this case.
Comparison with unregularized maximum likelihood model
Next, we consider a three different spatial processes, described by intensity functions
For the first process, the intensity peaks at , the second process is periodic with a period of spatial units, and for the third process the intensity grows monotonically with space . In all three cases, the number of events in a given region is then drawn as
using a negative binomial distribution, with mean given by (1) and number of failures set to , yielding a dataset . Note that the Poisson model class is misspecified here.
We set the nominal out-of-sample coverage and compare the interval sizes across space and the overall empirical coverage, when using regularized and unregularized criteria (7), respectively. The averages are formed using Monte Carlo simulations.
Figure 1(c) and Table 1 summarize the results of comparison between the regularized and unregularized approaches for the three spatial processes. While both intervals exhibit the same out-of-sample coverage (table 1), the unregularized method results in intervals that are nearly four times larger than those of the proposed method (figure 1(c)) in the missing region.
|Empirical coverage of [%]|
4.2 Real data
In this section we demonstrate the proposed method using two real spatial data sets. In two-dimensional space it is challenging to illustrate a varying interval , so for clarity we show its maximum value, minimium value and size as well as compare it with a point estimate obtained from the predictor, i.e.,
Throughout we use in (7).
Hickory tree data
First, we consider the hickory trees data set  which consists of coordinates of hickory trees in a spatial domain , shown in Figure 2(a), that is equipartitioned into a regular lattice of hexagonal regions. The dataset contains the observed number of trees in each region. The dashed boxes indicate regions data inside which is considered to be missing. For the cubic b-splines in , the spatial support was again set to cover all regions.
We observe that the point predictor interpolates and extrapolates smoothly across regions and appears to visually conform to the density of the point data. Figures 2(b) and 2(c) provide important complementary information using , whose upper limit increases in the missing data regions, especially when extrapolating in the bottom-right corner, and lower limit rises in the dense regions.
The size of the interval quantifies the predictive uncertainty and we compare it to the credibility interval using the lgcp model as above, cf. Figures 3(a) and 3(b). We note that the sizes increase in different ways for the missing data regions. For the top missing data region, is virtually unchanged in contrast to . While appears relatively tighter than across the bottom-right missing data regions, the credible interval lacks any out-of-sample guarantees that would make the prediction reliable.
Next we consider crime data in Portland police districts [16, 10] which consists of locations of calls-of-service received by Portland Police between January and March 2017 (see figure 4(a)). The spatial region is equipartitioned into a regular lattice of hexagonal regions. The dataset contains the reported number of crimes in each region. The support of the cubic b-spline is taken to be .
The point prediction is shown in Figure 4(a), while Figures 4(b) and 4(c) plot the upper and lower limits of , respectively. We observe that follows the density of the point pattern well, predicting a high intensity of approximately crimes/ in the center. Moreover, upper and lower limits of are both high where point data is dense. The interval tends to being noninformative for regions far away from those with observed data, as is visible in the top-left corner when comparing Figures 4(b) and 4(c).
We have proposed a method for inferring predictive intensity intervals for spatial point processes. The method utilizes a spatial Poisson model with an out-of-sample accuracy guarantee and the resulting interval has an out-of-sample coverage guarantee. Both properties hold even when the model is misspecified. The intensity intervals provide a reliable and informative measure of uncertainty of the point process. Its size is small in regions with observed data and grows along missing regions further away from data. The proposed regularized learning criterion also leads to more informative intervals as compared to an unregularized maximum likelihood approach, while its statistical guarantees renders it reliable in a wider range of conditions than standard methods such as lgcp inference. The method was demonstrated using both real and synthetic data.
The work was supported by the Swedish Research Council (contract numbers and ).
-  P. J. Diggle @ lancaster university. https://www.lancaster.ac.uk/staff/diggle/pointpatternbook/datasets/.
R. P. Adams, I. Murray, and D. J. MacKay.
Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities.In
Proceedings of the 26th Annual International Conference on Machine Learning, pages 9–16. ACM, 2009.
-  A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
-  N. Cressie and C. K. Wikle. Statistics for spatio-temporal data. John Wiley & Sons, 2015.
-  L. Devroye. Sample-based non-uniform random variate generation. In Proceedings of the 18th conference on Winter simulation, pages 260–265. ACM, 1986.
-  P. J. Diggle. A kernel method for smoothing point process data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 34(2):138–147, 1985.
-  P. J. Diggle. Statistics analysis of spatial point patterns. Hodder Education Publishers, 2003.
-  P. J. Diggle. Statistical analysis of spatial and spatio-temporal point patterns. Chapman and Hall/CRC, 2013.
-  P. J. Diggle, P. Moraga, B. Rowlingson, B. M. Taylor, et al. Spatial and spatio-temporal log-gaussian cox processes: extending the geostatistical paradigm. Statistical Science, 28(4):542–563, 2013.
-  S. Flaxman, M. Chirico, P. Pereira, and C. Loeffler. Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: a winning solution to the nij" real-time crime forecasting challenge". arXiv preprint arXiv:1801.02858, 2018.
-  J. B. Illian, S. H. Sørbye, and H. Rue. A toolbox for fitting complex spatial point process models using integrated nested laplace approximation (inla). The Annals of Applied Statistics, pages 1499–1530, 2012.
-  S. John and J. Hensman. Large-scale cox process inference using variational fourier features. 2018.
-  O. O. Johnson, P. J. Diggle, and E. Giorgi. A spatially discrete approximation to log-gaussian cox processes for modelling aggregated disease count data. arXiv preprint arXiv:1901.09551, 2019.
-  J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
-  C. Lloyd, T. Gunter, M. Osborne, and S. Roberts. Variational inference for gaussian process modulated poisson processes. In International Conference on Machine Learning, pages 1814–1822, 2015.
-  National Institute of Justice. Real-time crime forecasting challenge posting. https://nij.gov/funding/Pages/fy16-crime-forecasting-challenge-document.aspx#data.
-  H. Rue, S. Martino, and N. Chopin. Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. Journal of the royal statistical society: Series b (statistical methodology), 71(2):319–392, 2009.
-  R. Tibshirani, M. Wainwright, and T. Hastie. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015.
-  V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005.
-  C. J. Walder and A. N. Bishop. Fast bayesian intensity estimation for the permanental process. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3579–3588. JMLR. org, 2017.
-  L. Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.
-  C. K. Williams and C. E. Rasmussen. Gaussian processes for machine learning, volume 2. MIT Press Cambridge, MA, 2006.
-  R. Zhuang and J. Lederer. Maximum regularized likelihood estimators: A general prediction theory and applications. Stat, 7(1):e186, 2018.
For space divided into regions with each region denoted by its region index , the spatial basis vector evaluated at is an vector given by
Here is the cubic b-spline in space with two parameters: center and support. The component i.e has its center at the center of region and peak value when evaluated at . The support of determines its value at the neighbouring regions and hence allows to control the local structure in intensity in our model. For details on cubic b-spline see .
Let . By definition of dual norm,
The condition implies
where . Moreover,
Combining this with we get
Hoeffding’s inequality for
We show that is bounded in and hence we can make use of Hoeffding’s inequality to get eq. .
The gradient of eq. evaluated at is
where Given that the maximum number of counts is bounded i.e. , we have
for all . Here .
Union bound and DeMorgan’s Law
Given that , from eq. we get
By DeMorgan’s law,
By union bound,
which implies that
Eq. follows from above.
Here we derive the majorization-minimization (MM) algorithm that is used to solve eq. . Let and . For the Poisson model class considered in the paper,
where . is convex in since
are an basis matrix and a diagonal matrix respectively and
By convexity of , given an initial estimate , the objective in eq. can be upper bounded as
where is a quadratic majorization function (see , ch. 5) of given by
Here and . The diagonal elements of represent the average number of counts in different regions. Given that the counts in any region are bounded i.e. , therefore we have
Therefore starting from an initial estimate , one can minimize the right hand side of (SM1) to obtain then update and repeat until convergence to get final solution of eq. (7) . The pseudocode is given in algorithm .
Furthermore, the right hand side of (SM1) can be transformed into a weighted lasso regression problem and hence can be efficiently solved using coordinate descent algorithm . Letting , the right hand side of (SM1) can be rewritten as
where the first two terms form a weighted lasso regression problem in and the last term is independent of and does not affect the minimization problem. This conclude the derivation of the MM algorithm.