1 Introduction
Spatial point processes can be found in a range of applications from astronomy and biology to ecology and criminology. These processes can be characterized by a nonnegative intensity function which predicts the number of events that occur across space parameterized by [8, 4].
A standard approach to estimate the intensity function of a process is to use nonparametric kernel densitybased methods [6, 7]. These smoothing techniques require, however, careful tuning of kernel bandwidth parameters and are, more importantly, subject to selection biases. That is, in regions where no events have been observed, the intensity is inferred to be zero and no measure is readily available for a user to assess the uncertainty of such predictions. More advanced methods infer the intensity by assuming a parameterized model of the datagenerating process, such as inhomogeneous Poisson point process models. One popular model is the logGaussian Cox process (lgcp) model [9] where the intensity function is modeled as a Gaussian process [22] via a logarithmic link function to ensure nonnegativity. However, the infinite dimensionality of the intensity function makes this model computationally prohibitive and substantial effort has been devoted to develop more tractable approximation methods based on gridding [9, 13], variational inference [15, 12]
, Markov chain Monte Carlo
[2] and Laplace approximations [20] for the log and other link functions. A more fundamental problem remains in that their resulting uncertainty measures are not calibrated to the actual outofsample variability of the number of events across space. Poor calibration consequently leads to unreliable inferences of the process.In this paper, we develop a spatially varying intensity interval with provable outofsample performance guarantees. Our contributions can be summarized as follows:

the interval reliably covers outofsample events with a specified probability by building on the conformal prediction framework
[19], 
it is constructed using a predictive spatial Poisson model with provable outofsample accuracy,

its size appropriately increases in regions with missing data to reflect inherent uncertainty and mitigate sampling biases,

the statistical guarantees remain valid even when the assumed Poisson model is misspecified.
Thus the proposed method yields both reliable and informative predictive intervals under a wider range of conditions than standard methods which depend on the assumed model, e.g. lgcp [9], to match the unknown datagenerating process.
Notations: denotes the sample mean of . The elementwise Hadamard product is denoted .
2 Problem formulation
The intensity function of a spatial process is expressed as the number of events per unit area and varies over a spatial domain of interest, , which we equipartition into disjoint regions: and is a common means of modelling continuous inhomogeneous point processes, see [9, 13]. The function determines the expected number of events that occur in region by
(1) 
where is the region index and is the maximum number of counts.
We observe independent samples drawn from the process,
(2) 
where the datagenerating distribution is unknown. Let the collection of pairwise datapoints be denoted . Given this dataset, our goal is to infer an intensity interval of the unknown spatial point process, which predicts the number of events per unit area at location . See Figure 1 for an illustration in onedimensional space. A reliable interval will cover a new outofsample observation in a region with a probability of at least . That is, for a specified level the outofsample coverage is
(3) 
where is the area of the th region. Since the trivial noninformative interval also satisfies (3), our goal is to construct that is both reliable and informative.
3 Inference method
We begin by showing that an intensity interval with reliable outofsample coverage can be constructed using the conformal prediction framework [19]. Note that obtaining tractable and informative intervals in this approach requires learning an accurate predictor in a computationally efficient manner. We develop such a predictor and prove that it has finitesample and distributionfree performance guarantees. These guarantees are independent of the manner in which space is discretized.
3.1 Conformal intensity intervals
Let
denote a predictor parameterized by a vector
. For a given region , consider a new data point , where represents number of counts and takes a value between . The principle of conformal prediction is to quantify how well this new point conforms to the observed data . This is done by first fitting parameters to the augmented set and then using the score(4) 
where is the indicator function and are residuals for all observed data points . When a new residual is statistically indistinguishable from the rest, corresponds to a pvalue [19]. On this basis we construct an intensity interval by including all points that conform to the dataset with significance level , as summarized in Algorithm 1. Using [14, thm. 2.1], we can prove that satisfies the outofsample coverage (3).
While this approach yields reliable outofsample coverage guarantees, there are two possible limitations:

The residuals can be decomposed as , where the term in brackets is the model approximation error and is an irreducible zeromean error. Obtaining informative across space requires learned predictors with small model approximation errors.

Learning methods that are computationally demanding render the computation of intractable across space, since the conformal method requires refitting the predictor multiple times for each region.
Next, we focus on addressing both limitations.
3.2 Spatial model
We seek an accurate model of , parameterized by . For a given
, we quantify the outofsample accuracy of a model by the KullbackLeibler divergence
per sample,(5) 
In general, the unknown intensity function underlying has a local spatial structure and can be modeled as smooth since we expect counts in neighbouring regions to be similar in realworld applications. On this basis, we consider following the class of models,
where is spatial basis vector whose components are given by the cubic bspline function [21]
(see supplementary material). The Poisson distribution is the maximum entropy distribution for count data and is here parameterized via a latent field
across regions [4, ch. 4.3]. Using a cubic bspline basis [21], we model the mean in region via a weighted average of latent parameters from neighbouring regions, where the maximum weight in is less than 1. This parameterization yields locally smooth spatial structures and is similar to using a latent process model for the mean as in the commonly used lgcp model [9, sec. 4.1].The unknown optimal predictive Poisson model is given by
(6) 
and has an outofsample accuracy .
3.3 Regularized learning criterion
We propose learning a spatial Poisson model in using the following learning criterion
(7) 
where is the loglikelihood, which is convex [18], and is a given vector of regularization weights. The regularization term in (7) not only mitigates overfitting of the model by penalizing parameters in individually, it also yields the following finite sample and distributionfree result.
Theorem 1
Let , then the outofsample accuracy of the learned model is bounded as
(8) 
with a probability of at least
We provide an outline of the proof in Section 3.3.1, while relegating the details to the Supplementary Material. The above theorem guarantees that the outofsample accuracy of the learned model (7) will be close to of the optimal model (6), even if the model class (3.2) does not contain the true datagenerating process. As is increased, the bound tightens and the probabilistic guarantee weakens, but for a given data set one can readily search for the value of which yields the most informative interval .
The first term of (7) contains inner products which are formed using a regressor matrix. To balance fitting with the regularizing term in (7), it is common to rescale all columns of the regressor matrix to unit norm. An equivalent way is to choose the following regularization weights , see e.g. [3]. We then obtain a predictor as
and predictive intensity interval via Algorithm 1. Setting in (7) yields a maximum likelihood model with less informative intervals, as we show in the numerical experiments section.
3.3.1 Proof of theorem
Using the functional form of the Poisson distribution, we have
Then the gap between the outofsample and insample divergences for any given model is given by
(10) 
where the second line follows from using our Poisson model and is a constant. Note that the divergence gap is linear in , and we can therefore relate the gaps for the optimal model with the learned model as follows:
(11) 
where
is the gradient of (10
) and we introduce the random variable
for notational simplicity (see supplementary material).Inserting (9) into (11) and rearranging yields
(12) 
where the RHS is dependent on . Next, we upper bound the RHS by a constant that is independent of .
The weighted norm has an associated dual norm
see the supplementary material. Using the dual norm, we have the following inequalities
and combining them with (12), as in [23], yields
(13) 
when . The probability of this event is lower bounded by
(14) 
We derive this bound using Hoeffding’s inequality, for which
(15) 
and . Moreover,
using DeMorgan’s law and the union bound (see supplementary material). Setting , we obtain (14) Hence equation (13) and (14) prove Theorem 1. It can be seen that for , the probability bound on the right hand side increases with .
3.3.2 Minimization algorithm
To solve the convex minimization problem (7) in a computationally efficient manner, we use a majorizationminimization (MM) algorithm. Specifically, let and then we bound the objective in (7) as
(16) 
where is a quadratic majorizing function of such that , see [18, ch. 5]. Minimizing the right hand side of (16
) takes the form of a weighted lasso regression and can therefore be solved efficiently using coordinate descent. The pseudocode is given in Algorithm
2, see the supplementary material for details. The runtime of Algorithm 2 scales as i.e. linear in number of datapoints . This computational efficiency of Algorithm 2 is leveraged in Algorithm 1 when updating the predictor with an augmented dataset . This renders the computation of tractable across space.4 Numerical experiments
We demonstrate the proposed method using both synthetic and real spatial data.
4.1 Synthetic data with missing regions
To illustrate the performance of our learning criterion in (7), we begin by considering a onedimensional spatial domain , equipartitioned into regions. Throughout we use in (7).
Comparison with logGaussian Cox process model
We consider a process described by the intensity function
(17) 
and sample events using a spatial Poisson process model using inversion sampling [5]. The distribution is then Poisson. Using a realization , we compare our predictive intensity interval with a
obtained by assuming an lgcp model for the [9] and approximating its posterior belief distribution using integrated nested Laplace approximation (inla) [17, 11]. For the cubic bsplines in , the spatial support of the weights in was set to cover all regions.We consider interpolation and extrapolation cases where the data is missing across
and , respectively. Figures 1(a) and 1(b) show the intervals both cases. While is tighter than in the missing data regions, it has no outofsample guarantees and therefore lacks reliability. This is critically evident in the extrapolation case, where becomes noninformative further away from the observed data regions. By contrast, provides misleading inferences in this case.Comparison with unregularized maximum likelihood model
Next, we consider a three different spatial processes, described by intensity functions
For the first process, the intensity peaks at , the second process is periodic with a period of spatial units, and for the third process the intensity grows monotonically with space . In all three cases, the number of events in a given region is then drawn as
using a negative binomial distribution, with mean given by (
1) and number of failures set to , yielding a dataset . Note that the Poisson model class is misspecified here.We set the nominal outofsample coverage and compare the interval sizes across space and the overall empirical coverage, when using regularized and unregularized criteria (7), respectively. The averages are formed using Monte Carlo simulations.
Figure 1(c) and Table 1 summarize the results of comparison between the regularized and unregularized approaches for the three spatial processes. While both intervals exhibit the same outofsample coverage (table 1), the unregularized method results in intervals that are nearly four times larger than those of the proposed method (figure 1(c)) in the missing region.
Empirical coverage of [%]  

Proposed  Unregularized  
97.05  97.37  
91.05  98.32  
81.37  95.37 
4.2 Real data
In this section we demonstrate the proposed method using two real spatial data sets. In twodimensional space it is challenging to illustrate a varying interval , so for clarity we show its maximum value, minimium value and size as well as compare it with a point estimate obtained from the predictor, i.e.,
(18) 
Throughout we use in (7).
Hickory tree data
First, we consider the hickory trees data set [1] which consists of coordinates of hickory trees in a spatial domain , shown in Figure 2(a), that is equipartitioned into a regular lattice of hexagonal regions. The dataset contains the observed number of trees in each region. The dashed boxes indicate regions data inside which is considered to be missing. For the cubic bsplines in , the spatial support was again set to cover all regions.
We observe that the point predictor interpolates and extrapolates smoothly across regions and appears to visually conform to the density of the point data. Figures 2(b) and 2(c) provide important complementary information using , whose upper limit increases in the missing data regions, especially when extrapolating in the bottomright corner, and lower limit rises in the dense regions.
The size of the interval quantifies the predictive uncertainty and we compare it to the credibility interval using the lgcp model as above, cf. Figures 3(a) and 3(b). We note that the sizes increase in different ways for the missing data regions. For the top missing data region, is virtually unchanged in contrast to . While appears relatively tighter than across the bottomright missing data regions, the credible interval lacks any outofsample guarantees that would make the prediction reliable.
Crime data
Next we consider crime data in Portland police districts [16, 10] which consists of locations of callsofservice received by Portland Police between January and March 2017 (see figure 4(a)). The spatial region is equipartitioned into a regular lattice of hexagonal regions. The dataset contains the reported number of crimes in each region. The support of the cubic bspline is taken to be .
The point prediction is shown in Figure 4(a), while Figures 4(b) and 4(c) plot the upper and lower limits of , respectively. We observe that follows the density of the point pattern well, predicting a high intensity of approximately crimes/ in the center. Moreover, upper and lower limits of are both high where point data is dense. The interval tends to being noninformative for regions far away from those with observed data, as is visible in the topleft corner when comparing Figures 4(b) and 4(c).
5 Conclusion
We have proposed a method for inferring predictive intensity intervals for spatial point processes. The method utilizes a spatial Poisson model with an outofsample accuracy guarantee and the resulting interval has an outofsample coverage guarantee. Both properties hold even when the model is misspecified. The intensity intervals provide a reliable and informative measure of uncertainty of the point process. Its size is small in regions with observed data and grows along missing regions further away from data. The proposed regularized learning criterion also leads to more informative intervals as compared to an unregularized maximum likelihood approach, while its statistical guarantees renders it reliable in a wider range of conditions than standard methods such as lgcp inference. The method was demonstrated using both real and synthetic data.
Acknowledgments
The work was supported by the Swedish Research Council (contract numbers and ).
References
 [1] P. J. Diggle @ lancaster university. https://www.lancaster.ac.uk/staff/diggle/pointpatternbook/datasets/.

[2]
R. P. Adams, I. Murray, and D. J. MacKay.
Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities.
InProceedings of the 26th Annual International Conference on Machine Learning
, pages 9–16. ACM, 2009.  [3] A. Belloni, V. Chernozhukov, and L. Wang. Squareroot lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
 [4] N. Cressie and C. K. Wikle. Statistics for spatiotemporal data. John Wiley & Sons, 2015.
 [5] L. Devroye. Samplebased nonuniform random variate generation. In Proceedings of the 18th conference on Winter simulation, pages 260–265. ACM, 1986.
 [6] P. J. Diggle. A kernel method for smoothing point process data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 34(2):138–147, 1985.
 [7] P. J. Diggle. Statistics analysis of spatial point patterns. Hodder Education Publishers, 2003.
 [8] P. J. Diggle. Statistical analysis of spatial and spatiotemporal point patterns. Chapman and Hall/CRC, 2013.
 [9] P. J. Diggle, P. Moraga, B. Rowlingson, B. M. Taylor, et al. Spatial and spatiotemporal loggaussian cox processes: extending the geostatistical paradigm. Statistical Science, 28(4):542–563, 2013.
 [10] S. Flaxman, M. Chirico, P. Pereira, and C. Loeffler. Scalable highresolution forecasting of sparse spatiotemporal events with kernel methods: a winning solution to the nij" realtime crime forecasting challenge". arXiv preprint arXiv:1801.02858, 2018.
 [11] J. B. Illian, S. H. Sørbye, and H. Rue. A toolbox for fitting complex spatial point process models using integrated nested laplace approximation (inla). The Annals of Applied Statistics, pages 1499–1530, 2012.
 [12] S. John and J. Hensman. Largescale cox process inference using variational fourier features. 2018.
 [13] O. O. Johnson, P. J. Diggle, and E. Giorgi. A spatially discrete approximation to loggaussian cox processes for modelling aggregated disease count data. arXiv preprint arXiv:1901.09551, 2019.
 [14] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distributionfree predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
 [15] C. Lloyd, T. Gunter, M. Osborne, and S. Roberts. Variational inference for gaussian process modulated poisson processes. In International Conference on Machine Learning, pages 1814–1822, 2015.
 [16] National Institute of Justice. Realtime crime forecasting challenge posting. https://nij.gov/funding/Pages/fy16crimeforecastingchallengedocument.aspx#data.
 [17] H. Rue, S. Martino, and N. Chopin. Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. Journal of the royal statistical society: Series b (statistical methodology), 71(2):319–392, 2009.
 [18] R. Tibshirani, M. Wainwright, and T. Hastie. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015.
 [19] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005.
 [20] C. J. Walder and A. N. Bishop. Fast bayesian intensity estimation for the permanental process. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3579–3588. JMLR. org, 2017.
 [21] L. Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.
 [22] C. K. Williams and C. E. Rasmussen. Gaussian processes for machine learning, volume 2. MIT Press Cambridge, MA, 2006.
 [23] R. Zhuang and J. Lederer. Maximum regularized likelihood estimators: A general prediction theory and applications. Stat, 7(1):e186, 2018.
Appendix
Spatial basis
For space divided into regions with each region denoted by its region index , the spatial basis vector evaluated at is an vector given by
Here is the cubic bspline in space with two parameters: center and support. The component i.e has its center at the center of region and peak value when evaluated at . The support of determines its value at the neighbouring regions and hence allows to control the local structure in intensity in our model. For details on cubic bspline see [21].
Dual Norm
Let . By definition of dual norm,
The condition implies
where . Moreover,
Combining this with we get
Hoeffding’s inequality for
We show that is bounded in and hence we can make use of Hoeffding’s inequality to get eq. .
The gradient of eq. evaluated at is
where Given that the maximum number of counts is bounded i.e. , we have
for all . Here .
Union bound and DeMorgan’s Law
Given that , from eq. we get
Moreover,
By DeMorgan’s law,
By union bound,
which implies that
Eq. follows from above.
Minimization Algorithm
Here we derive the majorizationminimization (MM) algorithm that is used to solve eq. . Let and . For the Poisson model class considered in the paper,
where . is convex in since
Here,
are an basis matrix and a diagonal matrix respectively and
By convexity of , given an initial estimate , the objective in eq. can be upper bounded as
where is a quadratic majorization function (see [18], ch. 5) of given by
Here and . The diagonal elements of represent the average number of counts in different regions. Given that the counts in any region are bounded i.e. , therefore we have
(SM1) 
Therefore starting from an initial estimate , one can minimize the right hand side of (SM1) to obtain then update and repeat until convergence to get final solution of eq. (7) . The pseudocode is given in algorithm .
Furthermore, the right hand side of (SM1) can be transformed into a weighted lasso regression problem and hence can be efficiently solved using coordinate descent algorithm [18]. Letting , the right hand side of (SM1) can be rewritten as
where the first two terms form a weighted lasso regression problem in and the last term is independent of and does not affect the minimization problem. This conclude the derivation of the MM algorithm.