Quick inference for log Gaussian Cox processes with non-stationary underlying random fields

03/28/2019 ∙ by Jiří Dvořák, et al. ∙ Inra Charles University in Prague Aalborg University 0

For point patterns observed in natura, spatial heterogeneity is more the rule than the exception. In numerous applications, this can be mathematically handled by the flexible class of log Gaussian Cox processes (LGCPs); in brief, a LGCP is a Cox process driven by an underlying log Gaussian random field (log GRF). This allows the representation of point aggregation, point vacuum and intermediate situations, with more or less rapid transitions between these different states depending on the properties of GRF. Very often, the covariance function of the GRF is assumed to be stationary. In this article, we give two examples where the sizes (that is, the number of points) and the spatial extents of point clusters are allowed to vary in space. To tackle such features, we propose parametric and semiparametric models of non-stationary LGCPs where the non-stationarity is included in both the mean function and the covariance function of the GRF. Thus, in contrast to most other work on inhomogeneous LGCPs, second-order intensity-reweighted stationarity is not satisfied and the usual two step procedure for parameter estimation based on e.g. composite likelihood does not easily apply. Instead we propose a fast three step procedure based on composite likelihood. We apply our modelling and estimation framework to analyse datasets dealing with fish aggregation in a reservoir and with dispersal of biological particles.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 12

page 15

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation and objective

Inhomogeneity and aggregation (also called clumping or clustering) in spatial point patterns are frequently observed in nature because of spatially-structured covariates and random effects. Being able to infer these characteristics generally contributes to unravel influential underlying processes.

For example, Figure 1 shows the locations of 485 brown rust lesions on a wheat leaf. The lesions were obtained by the dispersal of spores emitted from one mother lesion which we refer to as the point source (the point (0,0) in the figure; further details are given in Section 5). Our aim is to describe how the intensity and the random clumping of lesions depends on the distance to the point source. Another example is given in Figure 2, which shows a part of a dataset consisting of the positions of 706 fish recorded along a boat trace. Here, our aim is to describe how the covariate shown in the figure (the spatially varying depth of water) influences the random organisation of the fish shoals (further details are given in Section 4).

Figure 1: Locations (open circles and crosses) of fungal lesions on a plant leaf (grey shape). The fungal lesions were formed after the dispersal of spores emitted from one lesion located at the intersection of the two dashed lines. The estimation algorithm was applied to the point pattern within the rectangle (open circles; 476 points), whereas locations of lesions outside the rectangle (crosses; 9 points) were not used.
Figure 2: Top panel: positions of fish recorded along the boat trace; only a 200 meter long part of the data is shown, the width of the observation window is 10 meters. Bottom panel: water depth; values range from 5 (dark grey) to 29 meters (light grey).

Inhomogeneity and aggregation in spatial point patterns are usually described in terms of the intensity function and the pair correlation function defined as follows. We consider an observed planar point pattern (such as in Figures 12) as a realization of the restriction of a planar point process to a bounded observation window . Excluding the case of multiple points, we can view as a locally finite random subset of . The points in are called events the points in , and we denote by the number of events within a region (strictly speaking needs to be a Borel set, but we suppress measure theoretical details; Section 2 provides more details on planar point processes). Now, it is assumed that whenever is bounded, the expected number of events in is finite and given by

(1)

and whenever are bounded and disjoint, the covariance between and exists and is given by

(2)

Whilst the intensity function describes the inhomogeneity, the pair correlation function describes the positive () or negative () dependence between pairs of counts; generally speaking, corresponds to aggregation of the events.

Thus, with a real dataset such as those shown in Figures 1-2, inhomogeneity and aggregation can be described by estimating and

. This estimation can be tackled by constructing flexible parametric models for

and and evaluating the most likely values for the model parameters. In this article, and are modelled in the framework of parametric and semiparametric models for spatial Cox processes, where we include first- and second-order non-stationarity into and , respectively. We develop a fast method for parameter estimation based on only and , which enables processing large point pattern data sets.

1.2 Spatial Cox processes

Recall that is a planar Poisson process with intensity function if for any bounded region ,

is Poisson distributed with mean

, and conditioned on , the events in are independent and identically distributed with density proportional to restricted to . Poisson processes provide models for first order inhomogeneity as specified by , whilst they are not flexible enough for modelling second and higher order spatial dependence properties as reflected, for instance, by the pair correlation function; see e.g. the discussion in Møller and Waagepetersen (2007) and the references therein.

In contrast, Cox processes (Cox, 1955) provide a large and flexible class of models for aggregation, where expressions for and are available. Recall that is a planar Cox process driven by a random field if conditioned on is a Poisson process with intensity function . For distinct points , we have

provided these functions are locally integrable, that is, the integrals in (1)–(2) are well-defined. Usually, for specific parametric Cox process models (including the models in this paper), we have positive dependence (), although it is possible to construct Cox processes where for some locations . Møller and Waagepetersen (2007) suggested separating first order inhomogeneity specified by from random effects specified by a residual random field such that for all ,

(3)

This allows us to model independently the intensity and the kind of aggregation which is not explained by the intensity. Typically, when covariate functions are available, a log linear intensity function is assumed:

(4)

where corresponds to an intercept and are real regression parameters. Note that the Cox process driven by has also pair correlation function . In Møller and Waagepetersen (2007) and subsequent work (e.g. Diggle (2013) and the references therein), is assumed to be stationary, that is, its distribution is invariant under translations in ; this implies that depends only on . However, we shall not make this assumption as we will need more flexibility when modelling for datasets as in Figures 12.

1.3 Spatial Cox processes with non-stationary underlying random fields

In this paper, we do not assume that is stationary, because we want to allow cluster sizes and extents to depend on spatial location. This point of view was also taken in Mrkvička (2014) and Mrkvička and Soubeyrand (2017) who considered the detection of different types of inhomogeneities by using shot noise Cox process models, as discussed further in Section 2.2. Here, in contrast, we will adopt the Cox process framework, but will extend it to particularly flexible cases.

Consider a log Gaussian Cox process (LGCP), that is, a Cox process as above when the log residual random field is a Gaussian random field (GRF) such that . Thus, denoting the covariance function of the GRF by

and the variance function by

, the GRF has mean function . Often (including the cases of this paper) or equivalently (as discussed further in Section 2.2). We will consider models where describes the overall ‘average’ inhomogeneity, and controls the spatial varying weights (in the sense of the number of events) of the clumps/clusters. Additionally, inspired by the space transformation approach in Sampson and Guttorp (1992) and subsequent work (e.g. Perrin and Senoussi, 2000; Fouedjio et al., 2015), we also allow the covariance function to be of the form

(5)

Here, is a positive semi-definite function with , denotes the Euclidean distance in , and is a bijective mapping. The specification of and allows us to model the spatial varying shape and scale of clusters. For instance, we can model elliptical shapes by linear functions (Sampson and Guttorp, 1992; Møller and Toftager, 2014), and model a varying spatial scale with or , where is a regular matrix and ; the latter is illustrated later in Section 5 when

is the identity matrix. Moreover, the geostatistical literature

(Cressie, 1993; Chilès and Delfiner, 1999; Stein, 1999) suggests numerous parametric models for the function which corresponds to the correlation function for a stationary and isotropic random field.

In summary, we let be a zero-mean and unit-variance GRF with an isotropic correlation function for , and consider

(6)

where is given by (4). To the best of our knowledge, such LGCPs have only been studied in the literature when is constant and is the identity mapping.

1.4 Outline and software

The remainder of this paper is organized as follows. Section 2 provides the needed background material on Cox processes and parametric inference procedures, and summarizes the attractive properties of LGCPs. Section 3 presents our proposal for performing rapid estimation of parameters of a LGCPs with non-stationary underlying random field using a three step procedure based on composite likelihoods. Our approach is illustrated in Section 4 in a case study concerning fish aggregation (Figure 2), and in Section 5 in a case study dealing with the dispersal of particles from a point source (Figure 1). Section 6 summarizes our findings in Sections 35. Finally, Appendix A–C provide details related to the data in Figure 1 and the space transformation function .

Computations were carried out with the R statistical software, including the packages spatstat, RandomFields and GET. Codes for running the estimation procedure are available upon request.

2 Background

Henceforth, we assume that is a Cox process driven by a random intensity function as in (3

) and a parametric statistical model has been specified in terms of an unknown parameter vector

, where is used to parametrize , to parametrize the distribution of , and and are variation independent. Thus, depends only on .

2.1 Composite likelihood estimation

Suppose is a bounded observation window and a realization has been observed. Due to the computational obstacles (especially if is large) related to the calculation of the likelihood function, computationally easier approaches for statistical inference based on and have been suggested, see the review in Møller and Waagepetersen (2017). This includes in particular first and second order composite likelihoods (Guan, 2006; Waagepetersen, 2007; Tanaka et al., 2008; Waagepetersen and Guan, 2009; Guan et al., 2015; Zhuang, 2015). The first-order log composite likelihood is given by

(7)

This is also known as the Poisson log likelihood and it can be interpreted as the composite log likelihood for binary indicators of the presence of events in cells of an infinitesimal partition of (Schoenberg, 2005; Waagepetersen, 2007; Guan and Loh, 2007; Møller and Waagepetersen, 2007). Further, for a tuning parameter , the second-order log composite likelihood is given by

(8)

and it can be interpreted as the log composite likelihood for the binary indicators of simultaneous occurrence of events in -close pairs of distinct cells and (Waagepetersen, 2007; Møller and Waagepetersen, 2007).

The following two step procedure is widely used for finding estimates of (Waagepetersen and Guan, 2009; Prokešová et al., 2017; Mrkvička et al., 2014). First, is obtained by maximizing the first-order log composite likelihood (7); the log-linear form for (see Equation (4)) allows a fast estimation of by using spatstat. Second, is obtained by maximizing the second-order log composite likelihood .

2.2 Pros of using log Gaussian Cox processes

The two main classes of Cox processes are log Gaussian Cox processes (LGCPs; see Section 1.3) and shot noise Cox processes (SNCPs; see Møller (2003), Møller and Waagepetersen (2004), and the references therein). For a SNCP driven by (3), the residual random field is of the form:

where is a planar Poisson process with intensity function , and where is a non-negative function such that or equivalently

(9)

Assuming is integrable, then conditioned on is distributed as , where is a finite Poisson process with intensity function , and where these Poisson processes for all are independent. Therefore is called the centre process and the cluster with centre . Moreover,

(10)

provided the integral exits, and then . In general, unless is stationary or equivalently is constant and for all , the integrals in (9)–(10) are not expressible on closed form; and even more complicated integrals appear for the third and higher order joint intensities. For example, Mrkvička and Soubeyrand (2017) considered non-stationary SNCPs, for which they needed to evaluate the integrals by numerical methods, and they proposed a Bayesian approach based on a MCMC algorithm. On one hand, their approach provides a detailed analysis of uncertainty and, on the other hand, it requires very time-consuming calculations.

In the aim of proposing fast composite likelihood based estimation procedure for second-order non-stationary Cox processes, we find LGCPs as given by (3) and (6) attractive for the following reasons. First, the pair correlation function is in a simple one-to-one correspondence to the covariance function of the Gaussian process :

Thus, the distribution of is uniquely determined by and . This is in line with the first and second order composite likelihood approach. Second, edge effects are not playing a role as the distribution of restricted to a bounded observation window is determined by restricted to together with the distribution of the Gaussian process restricted to . Finally, a LGCP possesses the following attractive properties, although these will not be exploited in this article: the reduced Palm distributions of a LGCP are also LGCPs with unchanged pair correlation function (Coeurjolly et al., 2017), and third and higher order intensities are easily expressed in terms of and (Møller et al., 1998).

3 Parameter estimation and model check for Cox processes driven by non-stationary random fields

Throughout this section, for specificity and simplicity, we assume that is a LGCP governed by (3) and (6). However, the description of the estimation procedure in Section 3.1 can easily be modified to other cases of parametric models for Cox processes driven by non-stationary random fields, including the case of SNCPs.

3.1 A three step composite likelihood based estimation approach

For parametric models of LGCPs with the non-stationary covariance structure introduced in Section 1.3, particularly in connection to the analysis of the data presented in Sections 4.1 and 5.1, we noticed after various experiments (not included in this article) that the two step procedure presented above yields strongly biased estimates, in particular for the parameters governing the second-order non-stationarity. In this section, we therefore suggest a three step procedure based on composite likelihoods that circumvents the non-stationarity issue as follows.

We split the dataset with respect to a spatial covariate function, say , which will be used to delineate disjoint areas of over which the point process is approximately second-order stationary. The three step procedure has to be adapted to the model of interest and, here, we only provide its skeleton. Let be a sequence of real numbers, and let for . We have in mind that are chosen such that (i) is approximately second-order stationary over , (ii) contain approximately the same numbers of points and (iii) .

Now, define the approximate second-order log composite likelihood restricted to and based on a second-order stationary assumption by

(11)

Here, each is a tuning parameter that may depend on , is the pair correlation function obtained by assuming that the covariance structure is stationary (that is, is constant and ), and is the corresponding set of parameters (typically the constant value of and the range parameter arising in the correlation function ). In addition, suppose the parameter related to the covariance structure is a vector , where contains the components of that will be estimated via the approximate second-order composite likelihood applied to the restricted point processes , and contains the components of that will be estimated via the second-order composite likelihood applied to the whole point process .

Then the estimation procedure consists of the three following steps.

  1. If the first-order intensity is modelled by a function parameterized by (like the log-linear function in Equation (4)), is obtained by maximizing the first-order log composite likelihood (7):

    Otherwise, a kernel-smoothing approach is used to estimate .

  2. The part of is estimated by firstly estimating for each via the maximization of the approximate second-order log composite likelihood (11):

    and by secondly fitting a regression with

    as the response variable and the average value of

    in , namely , as the explanatory variable to estimate each component of . In Sections 4 and 5, we show how these regressions are implemented in practice.

  3. After the transformation of the point pattern and the observation window by , the remaining part of is estimated by maximizing the second-order log composite likelihood (8) with respect to and by ignoring the estimate obtained for :

Remark 1: If no space deformation is assumed in the model, steps 2 and 3 can be reversed and, in this case, the estimate of might be plugged in for estimating .

Remark 2: A simulation study assessing the estimation efficiency is generally required to select the components of appearing in and , and to select the tuning parameters , and .

Remark 3: Using the second-order composite likelihood as if the point pattern is second-order stationary and isotropic provides a computational advantage: the double integral in the composite likelihood criterion (11) is then of the form

and can be computed as the Lebesgue-Stieltjes integral

(12)

where . Note that does not depend on the parameter . Thus, is calculated only once and only the integral is computed in each step of the iterative optimization of the composite likelihood criterion.

3.2 Test based on global envelopes of the inhomogeneous -function

Our fitted models were checked with plots of global envelopes and accompanying tests (Myllymäki et al., 2017; Mrkvička et al., 2017). The global envelopes can be based on various functional summaries of point processes, e.g. topological characteristics, morphological characteristics, and third-order characteristics. These statistics can be viewed as appropriate for model checking because they are not used in the estimation procedure considered above. However, they do not directly reflect the dependence of the point pattern structure on spatial covariates, although such a dependence should be taken into account in model checking of point process models with a spatially changing structure.

To reveal the dependence on a given spatial covariate in connection to a LGCP, we have used the inhomogeneous -function (Van Lieshout, 2011) computed on several subwindows, which are determined according to the spatial covariate. For example, in the group dispersal model, the subwindows are generated with respect to the distance from the source point emitting the particles. The first subwindow is the inner disk centred around the source point, the second subwindow is the subsequent circular ring, and so on. Note that only the point lying in each subwindow were used for the computation of each inhomogeneous -function.

4 Fish aggregation

4.1 Data and model

The point pattern studied in this section is part of a larger dataset studied in Mrkvička et al. (2014) and Muška et al. (2018). It consists of the locations of 706 fish observed during a summer day along the trace made by a boat in the middle part of the inland water reservoir Rimov in Czech Republic. The trace is 916 meter long and the boat crossed the river twice. The fish locations were recorded within a distance of 10 to 20 meters from the boat and in a depth of 1 to 1.75 meters. The data we analysed are given by the -coordinates of the 3D-locations of fish, where the observation window is of size meters (note that is larger than the region shown in Figure 2). In fact, as the reservoir is thermally stratified, the majority of fish are in the upper 5 meters of the water column during the summer, and the horizontal distance is considered to be more important than the vertical distance between fish (Jarolím et al., 2010). The aim is to study the effect of a spatial covariate, namely the water depth denoted by , , on the distribution of fish shoals and, more exactly, on the dispersion of this distribution.

We used a semiparametric LGCP, with an intensity function estimated in a non-parametric way (as detailed below) and where the ingredients of the covariance function in (5

) is specified as follows. The standard-deviation depends linearly on the spatial covariate:

(13)

where and are real parameters such that for all . Further, there is no space transformation, that is, is the identity function. Finally, the auto-correlation function is exponential with parameter :

This LGCP can generate point patterns corresponding to non-stationary aggregation of fish, with plenty of small shoals or a few large shoals or something in between depending on the covariate values and the parameter values.

4.2 Implementation of the three step procedure

We adapted the three step estimation procedure of Section 3.1, with and , and with steps as follows.

Step 1: For the estimation of the intensity function , we used a kernel smoothing method with a large bandwidth (Gaussian kernel with standard deviation equal to 50 meters) in order to allow the estimated function to vary along the boat trace.

Step 2: For the estimation of and , we started by letting , and defined disjoint subwindows as areas corresponding to different ranges of water depth as determined by the values of . We chose these values such that the subwindows contain approximately the same number of points, e.g.  contains approximately 141 points lying in the most shallow water. Then, was estimated by maximizing the approximate second-order log composite likelihoods based on . Let and denote the estimates of and , respectively, and let denote the average value of over . By assuming that the function is approximately constant (and approximately equal to ) over , the standard-deviation function can be approximated by

Consequently, we used as a natural estimate of

. Finally, the linear regression model

was fitted to data and yielded the estimates of and .

When applying the same value of tuning parameters in all subwindows, we observed that the estimates of and were rather robust with respect to the choice of in the range from 6 to 11 meters. In order to include as many pairs of points as possible into the composite likelihood criterion (recall that each subwindow contains only approximately 141 points), we finally decided to use the value meters for all subwindows.

Step 3: For the estimation of , we considered as fixed and maximized the second-order composite likelihood (8) with respect to only, thereby obtaining an estimate . Note that since is non-constant, we could not take advantage of the Lebesgue-Stieltjes integral (12) but needed in each iteration step to calculate the double integral in (8). For the choice of the tuning parameter , preliminary experiments showed that the estimates of were rather stable for between 5 and 15 meters. Since the width of the observation window is only 10 meters, it does not seem reasonable to use a very large value of . We decided to use the same value as in the second step, that is, meters.

4.3 Results

Having estimated the intensity function in the first step of the estimation procedure, it is straightforward to test the null hypothesis of no interaction between the individual fish by performing the rank envelope test and to construct the envelope based on simulations of the fitted inhomogeneous Poisson process. Figure 

3

shows the simultaneous 95% envelope constructed from 9999 simulated curves, when the test statistic

is given by concatenating together the five inhomogeneous -functions estimated in the subwindows . The data curve (solid black line) leaves the envelope for four of the five subwindows, and the -value is below 2%, so we reject the null hypothesis of no interaction between the fish. Figure 3 also shows that the level of clustering is bigger for shallower subwindows.

Figure 3: Rank envelope test for the null hypothesis of no interaction for the fish aggregation dataset.

In the second step of the estimation procedure, a linear regression of against average water depths, see Figure 4, yields the estimates and of the parameters governing the second-order non-stationarity of the point process. By ignoring uncertainties in the estimates , the parameter is significantly different from zero (-value=0.015; value obtained with the R function glm).

Figure 4: Estimated values of plotted against the mean value of the covariate in the subwindow , together with the fitted regression line.

Table 1 provides the estimates of the model parameters, including obtained in the third step. The table also specifies central 90%-percentile intervals of the estimates obtained with a parametric bootstrap approach (Efron and Tibshirani, 1993) where we re-estimated the model parameters from each of 1000 simulated point pattern datasets obtained under the fitted LGCP. Using this parametric bootstrap approach here is more demanding than for the particle-dispersal application considered later, since the double integral in (8) needs to be evaluated at each iteration of the maximization procedure required by the 3rd step. A typical computation time took about a minute on a regular desktop computer for datasets with about 700 points.

In addition to Table 1, Figure 5 shows the empirical marginal distributions of the estimates obtained from the 1000 simulated datasets. The regression parameters and governing the variance structure of the GRF have empirical distributions with a high concentration around their estimated values and a small amount of extremes. The empirical distribution of the scaling parameter includes a high proportion of values near zero, corresponding to a long-range dependence in the GRF and hence also in the LGCP, cf. (6).

Goodness-of-fit of the fitted LGCP was assessed with the global envelope test described in Section 3.2. For this we used 9999 simulations to obtain the envelopes of the concatenated inhomogeneous -functions computed over the five subwindows . The fitted LGCP model is not rejected by the test at the 5% risk level (-interval=[0.083,0.097]); see Figure 6. The central (dotted) curve of the global envelope test reveals a decrease in the level of clustering with respect to increasing water depth.

Parameter Estimate 90%-CI
1.35 [-0.17 , 1.81]
-0.033 [-0.057 , 0.076]
1.02 [0.10 , 1.22]
Table 1: Estimated parameter values of the LGCP fitted to the fish aggregation dataset, together with their central 90%-percentile intervals obtained by a parametric bootstrap approach as discussed in Section 4.1.
Figure 5: Parameter estimates of the LGCP fitted to the fish aggregation dataset (grey vertical lines) and their bootstrap-based marginal distributions (histograms).
Figure 6: Rank envelope test to assess the goodness-of-fit of the LGCP fitted to the fish aggregation dataset. Envelopes and the -interval were computed with 9999 simulations.

We investigated the efficiency of our estimation procedure for 8 series of 1000 simulations corresponding to the 8 combinations of parameter values obtained when , , and each takes two values, see Table 2. Note that the function is constant when (Series 1, 2, 5, and 6) and varying when (Series 3, 4, 7, and 8), and our choice of parameter values was in part constrained by the range of the covariate values (5 to 29 meters) and made to avoid negative values for or values close to zero for . Further, in the simulations, we used the same observation window and the same covariate describing the water depth as for the fish aggregation dataset, and also we applied the same estimation procedure. In particular, we used for all . Figure 7 provides estimation results for each series of simulations. The regression parameters and governing the spatially structured variance function of the random field are estimated with negligible bias whatever the value of , and with rather small variability when the spatial correlation of the random field is small (; Series 5–8). On the other hand the estimation of is more difficult. This is not surprising because is estimated in the third estimation step and relies on the estimated values of and .

Series of simulations
1 2 3 4 5 6 7 8
0.25 0.25 0.25 0.25 0.50 0.50 0.50 0.50
1.25 1.50 1.25 1.50 1.25 1.50 1.25 1.50
0 0 -0.03 -0.03 0 0 -0.03 -0.03
Table 2: Parameter values used in the simulation study related to the estimation procedure for the fish aggregation dataset.

Table 3 shows the results of the test of the hypothesis based on the linear regression of against the covariate used to model (namely, the average water depths in the application). This test, which is a way to assess the dependence of the second order structure of the LGCP model on the specified covariate, was applied to the 8 series of 1000 simulations with different levels of significance (1%, 5% and 10%). Based on the results for Series 1, 2, 5, and 6, the actual significance level of the test is approximately equal to or lower than the nominal significance level, that is, the test is slightly conservative. The power of the test, assessed with Series 3, 4, 7, and 8, is increased when the spatial auto-correlation is larger at shorter distances (smaller ), but could certainly be improved with a test better accounting for the propagation of uncertainties in the first two steps of the estimation procedure.

Figure 7:

Summaries of the empirical distributions of parameter estimates for the LGCP with the variance of the random field depending on a covariate and fitted to 8000 simulated datasets which are split into 8 series of simulations corresponding to different parameter values (see main text). Stars: true parameter values. Circles: trimmed means of parameters obtained by removing the lower and upper 5% of estimates (the trimming was done to moderate the effect of extreme estimates). Triangles: endpoints of 90%-confidence intervals.

Significance Series of simulations
level 1 2 3 4 5 6 7 8
0.01 6 10 56 57 4 5 42 40
0.05 40 50 236 254 33 23 196 156
0.10 77 101 359 410 62 64 305 284
Table 3: Numbers of rejections (over 1000) in the test of applied to the 8 series of simulations with different significance levels.

5 Particle dispersal

5.1 Data and model

In this section, we are interested in point patterns generated by the deposit locations of biological windborne particles (such as seeds, pollen grains and pathogenic fungal spores) dispersed from plants. In general, for an isolated plant releasing particles, the point pattern formed by the deposit locations of the particles is denser around the source plant than far from it. In other words, the intensity of points generally decreases with the distance from the source along radial directions (Austerlitz et al., 2004; Rieux et al., 2014; Soubeyrand et al., 2007b; Tufto et al., 1997). This large-scale inhomogeneity may be augmented by small-scale variations due, for instance, to group dispersal (Soubeyrand et al., 2011): groups of particles are transported independently, but the particles of a given group are deposited at dependent locations and form a cluster of points whose spatial extent is smaller at shorter distances (dependencies between the particles of a group vanishes at longer distance because of the accumulation of air turbulence; see Appendix A for an extended explanation). In such a case, deposit locations of particles form an inhomogeneous point pattern with spatially varying local aggregation, which may be modelled by a LGCP with a non-stationary auto-correlation function for the GRF. We used this modelling framework and the accompanying estimation procedure to the point pattern in Figure 1 formed by the deposit locations of fungal spores dispersed from a point source located at the origin (details about this data set can be found in Lannou et al., 2008). Here, we assume a log linear intensity as in (4) with , , and . Such a specification corresponds to the so-called exponential dispersal kernel , and it means that the intensity of deposit locations of particles decreases exponentially with distance from the point source when . Let be a constant function, and use the space transformation

(14)

where is a non-negative parameter for modelling an eventual increase in the spatial extent of clusters with the distance from the source point. Finally, assume an exponential correlation function of the GRF:

(15)

where is a positive parameter.

5.2 Implementation of the three step procedure

We adapted the three step estimation procedure of Section 3.1 to the LGCP described above incorporating the parametric space transformation (14). Here we let and .

Step 1: For the estimation of and , we maximized the first-order composite likelihood to obtain estimates and .

Step 2: For the estimation of , we started by defining subwindows based on the distance from the point source by setting so that is the inner disk and are the subsequent rings; more details are given below. Then, for each , we estimated by maximizing the approximate second-order log composite likelihoods based on . Let denote the estimate of . Assuming that the function is approximately equal to a constant on each subwindow , the covariance function of can be approximated for by

Hence, we first estimated by and second, to obtain an estimate of , we fitted the non-linear regression model to the data by minimizing with respect to and . For our dataset, we chose ; since only 5 points then are entering the regression, we have used absolute difference in order to increase robustness of the regression estimate. We applied the following rules of thumb for selecting and : were selected such that contain approximately the same numbers of points and ; for , we selected such that about 50% of the observed pairs of points in were used in the computation of the partial approximate second-order log composite likelihood (11).

Step 3: For the estimation of and we used the transformed process where . The motivation was that, assuming the true value of is known, is a second-order intensity-reweighted stationary LGCP (Baddeley et al., 2000) with known form of the intensity function and the pair-correlation function. See Appendix B for a detailed derivation and Figure 8 for the original pattern and the pattern transformed by , where is given in Table 4. We approximated the intensity function of in the observation window by a log-linear function of and estimated it by the first-order composite likelihood. We used this estimated intensity function to construct the second-order composite likelihood criterion for estimation of and . Again we chose the value of so that about 50 % of the observed pairs of points were used in the criterion.

Figure 8: Top: Point pattern (open circles) analysed in the particle dispersal application and location of the source of dispersed particles (cross). Bottom: transformed point pattern .

5.3 Results

Table 4 provides the estimates of (transformed) model parameters obtained with the three step procedure. It also gives their 90%-percentile intervals obtained with the same parametric bootstrap procedure as in Section 4.3; this was fast: it took typically a few seconds on a regular desktop computer for datasets with about 500 points. Figure 9 shows the marginal distributions of estimates obtained from the 1000 simulated datasets, and it shows the estimated space transformation function and its pointwise 90%- and 50%-confidence envelopes. We notice a rather large estimation variability for second-order parameters . The goodness-of-fit of the model was assessed with the global envelope test described in Section 3.2, using 9999 simulations to obtain the envelopes of the concatenated inhomogeneous -functions computed over the five subwindows . The fitted LGCP model is not rejected by the test at the 5% risk level (-interval=[0.095,0.102]); see Figure 10. The observed curve indicates repulsion in the first two subwindows. This is probably a consequence of the non-negligible size of the particles, an assumption that is not accounted for in our model.

Name of transformed parameter Parameter expression Estimate 90%-CI
Infection strength 1760 [1140 , 2820]
Dispersal range parameter 0.62 [0.44 , 0.98]
Random field standard-deviation 0.73 [0.55 , 3.50]
Random field scale 0.15 [0.02 , 34.3]
Space transformation parameter 0.93 [0.42 , 5.54]
Table 4: Estimation of parameters of the LGCP with space transformation fitted to the particle dispersal dataset. Original parameters are transformed to obtain interpretable quantities. The infection strength gives the expected total number of spores released by the source. The dispersal range parameter is half the mean dispersal distance of spores. The random field scale parameter is the distance at which the auto-correlation is when there is no space transformation.
Figure 9: Estimates of transformed parameters of the LGCP with space transformation fitted to the particle dispersal dataset (vertical lines) and their bootstrap-based marginal distributions (histograms); expressions of transformed parameters are provided in Table 4. The bottom right panel provides the estimated space transformation function (solid curve) and its pointwise 90%- and 50%-confidence envelopes (dotted and dashed lines, respectively).
Figure 10: Rank envelope test to assess the goodness-of-fit of the LGCP with space deformation fitted to the particle dispersal dataset. Envelopes and the -interval were computed with 9999 simulations.

We investigated the efficiency of our estimation procedure for 8 series of 1000 simulations performed with 8 different sets of parameter values. In all series, infection strength and dispersal range were equal to , whilst the values of are given in Table 5. Figure 11 provides estimation results and the distribution of the number of points for each simulation series. The estimation of the infection strength and the dispersal range is quite accurate except when the GRF is strongly variable () and its spatial correlation is large (). The latter situation of the GRF can generate large clusters of points in the LGCP away from the source of particles and negatively impact the estimation accuracy of the spatial trend in the intensity of points, which is supposed to be decreasing with the distance from the source of particles. Moreover, if the GRF’s standard-deviation is appropriately estimated, we observe rather large estimation variability and possible biases for and , the parameters arising in the non-stationary auto-correlation function. Overall, the inference is relatively accurate when the GRF’s variance is large () and the spatial deformation is significant (); i.e., series 6 and 8. In contrast, for a rather flat GRF or when the deformation is not so significant, the estimation of parameters governing the second-order structure is clearly uncertain due to a complex model and the limited amount of information in the data.

Series of simulations
1 2 3 4 5 6 7 8
0.5 2 0.5 2 0.5 2 0.5 2
0.05 0.05 0.2 0.2 0.05 0.05 0.2 0.2
1 1 1 1 10 10 10 10
Table 5: Parameter values used in the simulation study related to the estimation procedure for the particle dispersal dataset.
Figure 11: Estimation of transformed parameters of the LGCP with space transformation fitted to 8000 simulated datasets split into 8 series of simulations corresponding to different parameter values (see main text); expressions of transformed parameters are provided in Table 4. Stars: true parameter values. Circles: trimmed means of parameters obtained by removing the lower and upper 5% of estimates (the trimming was done to moderate the effect of extreme estimates). Triangles: endpoints of 90%-confidence intervals. The bottom right panel provides the distribution of the number of points in for each simulation series. This number of points was constrained to be between 200 and 1000.

6 Concluding remarks

In this article, we modelled inhomogeneity in spatial point patterns by a Cox process driven by a second-order non-stationary log GRF. We presented a first example where the non-stationarity was obtained by modelling the variance of the GRF as a linear function of an explanatory variable. In a second example, the non-stationarity was obtained via a spatial parametric deformation, which enables the generation of point clusters with spatially varying spatial extents. In this framework, estimating model parameters, including those that govern the second-order non-stationarity, is challenging, especially with a moderate number of points (a few hundreds in our examples). We developed and tested a three step estimation approach based on first- and second-order composite likelihoods. The second-order composite likelihood has to be specifically adapted to the non-stationarity incorporated into the model. The advantages of this estimation procedure are the reduced computation time and the reduced implementation complexity. Reduced computation time allowed us to implement parametric bootstrap for assessing estimation uncertainty, and computationally intensive model check approaches, namely global envelope tests, for assessing goodness-of-fit.

The performance of the estimation approach was evaluated with two simulation studies performed in settings inspired by the two case studies. These simulation studies highlight situations where a satisfactory estimation accuracy can be obtained and, conversely, situations where estimation performance is poor. Poor estimation mostly translates into large estimation variance and, from a general point of view, arises when variations in the underlying random field are weak or the spatial deformation is not so significant (due to a complex model and the limited information contained in the data). A part of the estimation poorness can also be attributed to the choice of a three steps procedure, which neglects uncertainty propagation. However, this procedure definitely allows to reduce computation time. Furthermore, it has to be noted that tuning parameters used in the calculation of the second-order composite likelihood have to be appropriately specified to reach satisfactory estimation performance (see Sections 4.2 and 5.2). These tuning parameters can be chosen via a preliminary simulation study.

In order to graphically explore the dependence of the second order structure on a given covariate we utilized the global envelope test of independence of points locations, which was performed on first order inhomogeneous -function computed across several subwindows which respect to the value of the given covariate (e.g. Figure 3).

We also introduced an approximate test of dependence of the second order structure on a given covariate based on regression of second order parameters estimated across the subwindows which respect to the value of the given covariate. Our simulation study shows that this test is slightly conservative, which is acceptable in practise.

As discussed in the first paragraph of this section, the two studied examples give two different approaches for modelling complex point process structures. But the way of modelling the second-order non-stationarity is not the only difference. Indeed, the first order intensity is described by a nonparametric model in the fish application, and a parametric model in the particle dispersal application. This choice is related to a generic difficulty in modelling, which has to be carefully investigated for every particular dataset: how to divide data variability between the first order intensity and the second order structure. In the case of a nonparametric first order intensity, this choice is done in the selection of the bandwidth. In the case of a parametric first order intensity, this choice is constrained by the flexibility of the parametric model. For instance, using a small bandwidth in the former case can cause overfitting of the first order intensity with no detection of second order structure; and using a too simple parametric model of the intensity in the latter case can induce a bias, which then impacts the estimation of the second order structure. Such issues generally rely on the usual modelling strategy of the analyst but should also rely on the classical process of statistical analysis consisting of defining a class of models, selecting a model based on the data (including parameter estimation), checking model fit, and possibly re-defining the class of models for starting again the process and finally drawing conclusions (McCullagh and Nelder, 1989, Chapter 12).

Finally, from a prospective point of view, with increasing amount of the gathered data, there is increasing demand for the more complex modelling. The next level for complexity of the presented models is certainly the shift into spatio-temporal modelling framework of e.g. generations of the infection spread or several spatial snapshots of the point pattern gathered in different times.

Acknowledgements

Supported by The Danish Council for Independent Research — Natural Sciences, grant DFF – 701400074 ‘Statistics for point processes in space and beyond’, by the ‘Centre for Stochastic Geometry and Advanced Bioimaging’, funded by grant 8721 from the Villum Foundation, by Grant Agency of Czech Republic (Project Nos. 19-04412S).

Appendix A Small-scale variations in particle dispersal

Section 5 deals with point patterns generated by the deposit locations of biological windborne particles dispersed from plants. As indicated in the main text, the intensity of points generally decreases with the distance from the source along radial directions. Nevertheless, in particular cases, the intensity of points along radial directions may be non-monotonic in the vicinity of the source (Stoyan and Wagner, 2001). This leads to a large-scale inhomogeneity in the point pattern formed by the deposit locations of the particles. In addition, the inhomogeneity of the point process may be augmented by small-scale variations due, for instance, to dependencies in the dispersal of particles (Soubeyrand et al., 2011) or to a spatial heterogeneity of the environment (Soubeyrand et al., 2007a).

Consider a situation where aggregation is due to dependencies in the dispersal of particles, cf. Section 5. Soubeyrand et al. (2011) investigated such a situation, in which (i) the particles are released as groups of particles, (ii) the particles of each group are transported together by the wind but the accumulation of air turbulences progressively separate the particles, and (iii) the particles of each group are deposited at nearby locations but the proximity of the deposit locations depends on the accumulation of air turbulences. Typically, the proximity of the deposit locations decreases with the distance separating the source and the final group center location (the larger this distance is, the larger the accumulation of air turbulences and larger the separation of the particles is). A more complete bio-physical justification for such a process is provided in Soubeyrand et al. (2011). In this situation, one expects that the deposit locations of the particles of a group form a cluster whose spatial extent increases with the distance from the source. Consequently, aggregates of points at longer distances from the source should have wider spatial extent. To model this, we can consider a non-stationary auto-correlation function for the zero-mean and unit-variance GRF in (6).

In contrast, consider a situation where aggregation is due to a spatial heterogeneity of the environment. If we assume in addition that the spatial heterogeneity is stationary, then the probabilistic properties of aggregates should be the same across space and the auto-correlation function for may be stationary.

In the application, the bootstrap analysis is only moderately in favour of the presence of deformation (see bottom right panel of Figure 9, where the 90%-confidence envelope is quite wide). Thus, we cannot conclude that ‘group dispersal occurs’ or that ‘the spatial extent of groups varies with distance at the scale of the study domain if group dispersal occurs’. Replicated data could be helpful for more deeply investigating this issue.

Appendix B Properties of

Let the situation be as in Section 5. This appendix shows that is a second-order intensity-reweighted stationary LGCP.

Let denote the random intensity function of the process given by , and let be the corresponding random intensity measure, that is, for any Borel set . In the third step of the estimation procedure for the particle dispersal dataset, we consider the transformed process . To find the distribution of , we compute void probabilities: For any Borel set ,

where for given by (14). Applying the substitution theorem, we get

where is the Jacobian. Hence

Since the distribution of a locally finite simple point process is uniquely determined by the void probabilities, it follows that is a Cox process with random intensity function

The argument of the exponential above is the sum of the stationary and isotropic Gaussian random field and the deterministic trend given by the other terms, i.e., it is a Gaussian random field with non-constant mean but stationary and isotropic covariance structure. Hence is a second-order intensity-reweighted stationary LGCP (Baddeley et al., 2000) with a tractable form of the intensity function and the pair-correlation function.

References

  • Austerlitz et al. (2004) Austerlitz, F., Dick, C. W., Dutech, C., Klein, E. K., Oddou-Muratorio, S., Smouse, P. E. & Sork, V. L. (2004). Using genetic markers to estimate the pollen dispersal curve. Molecular Ecology 13, 937–954.
  • Baddeley et al. (2000) Baddeley, A. J., Møller, J. & Waagepetersen, R. (2000). Non- and semi-parametric estimation of interaction in inhomogeneous point patterns. Statistica Neerlandica 54, 329–350.
  • Chilès and Delfiner (1999) Chilès, J.-P. & Delfiner, P. (1999). Geostatistics. Modeling Spatial Uncertainty. Wiley, New York.
  • Coeurjolly et al. (2017) Coeurjolly, J.-F., Møller, J. & Waagepetersen, R. (2017). Palm distributions for log Gaussian Cox processes. Scandinavian Journal of Statistics 44, 192–203.
  • Cox (1955) Cox, D. R. (1955). Some statistical models related with series of events. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 17, 129–164.
  • Cressie (1993) Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley Series in Probability and Mathematical Statistics, Wiley, revised edition.
  • Diggle (2013) Diggle, P. J. (2013). Statistical Analysis of Spatial and Spatio-temporal Point Patterns. Chapman and Hall/CRC, Boca Raton, 3rd edition.
  • Efron and Tibshirani (1993) Efron, B. & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall, New York.
  • Fouedjio et al. (2015) Fouedjio, F., Desassis, N. & Romary, T. (2015). Estimation of space deformation model for non-stationary random functions. Spatial Statistics 13, 45–61.
  • Guan (2006) Guan, Y. (2006). A composite likelihood approach in fitting spatial point process models. Journal of the American Statistical Association 101, 1502–1512.
  • Guan and Loh (2007) Guan, Y. & Loh, J. M. (2007). A thinned block bootstrap procedure for modeling inhomogeneous spatial point patterns. Journal of the American Statistical Association 102, 1377–1386.
  • Guan et al. (2015) Guan, Y., Waagepetersen, R. & Jalilian, A. (2015). Quasi-likelihood for spatial point processes. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 77, 677–697.
  • Jarolím et al. (2010) Jarolím, O., Kubečka, J., Čech, M., Vašek, M., Peteřka, J. & Matěna, J. (2010). Sinusoidal swimming in fishes: the role of season, density of large zooplankton, fish length, time of the day, weather condition and solar radiation. Hydrobiologia 654, 253–265.
  • Lannou et al. (2008) Lannou, C., Soubeyrand, S., Frezal, L. & Chadœuf, J. (2008). Autoinfection in wheat leaf rust epidemics. New Phytologist 177, 1001–1011.
  • Van Lieshout (2011) Van Lieshout, M. N. M. (2011). A -function for inhomogeneous point processes. Statistica Neerlandica 65, 183–201.
  • McCullagh and Nelder (1989) McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models, 2nd Edition. Chapman & Hall/CRC, London.
  • Møller (2003) Møller, J. (2003). Shot noise Cox processes. Advances in Applied Probability 35, 614–640.
  • Møller and Toftager (2014) Møller, J. & Toftager, H. (2014). Geometric anisotropic spatial point pattern analysis and Cox processes. Scandinavian Journal of Statistics 41, 414–435.
  • Møller and Waagepetersen (2017) Møller, J. & Waagepetersen, R. (2017). Some recent developments in statistics for spatial point patterns. Annual Review of Statistics and Its Applications 4, 317–342.
  • Møller and Waagepetersen (2004) Møller, J. & Waagepetersen, R. P. (2004). Statistical Inference and Simulation for Spatial Point Processes. Chapman and Hall/CRC, Boca Raton.
  • Møller and Waagepetersen (2007) Møller, J. & Waagepetersen, R. P. (2007). Modern statistics for spatial point processes. Scandinavian Journal of Statistics 34, 643–684.
  • Møller et al. (1998) Møller, J., Syversveen, A. R. & Waagepetersen, R. P. (1998). Log Gaussian Cox processes. Scandinavian Journal of Statistics 25, 451–482.
  • Mrkvička (2014) Mrkvička, T. (2014). Distinguishing different types of inhomogeneity in Neyman-Scott point processes. Methodology and Computing in Applied Probability 16, 385–395.
  • Mrkvička and Soubeyrand (2017) Mrkvička, T. & Soubeyrand, S. (2017). On parameter estimation for doubly inhomogeneous cluster point processes. Spatial Statistics 20, 191 – 205.
  • Mrkvička et al. (2014) Mrkvička, T., Muška, M. & Kubečka, J. (2014). Two step estimation for Neyman-Scott point process with inhomogeneous cluster centers. Statistics and Computing 24, 91–100.
  • Mrkvička et al. (2017) Mrkvička, T., Myllymäki, M. & Hahn, U. (2017). Multiple monte carlo testing, with applications in spatial point processes. Statistics and Computing 27(5), 1239 – 1255.
  • Muška et al. (2018) Muška, M., Tušer, M., Frouzová, J., Mrkvička, T., Richard, D., Seda, J., Morelli, F. & Kubečka, J. (2018). Real-time distribution of pelagic fish: combining hydroacoustics, gis and spatial modelling at a fine spatial scale. Scientific Reports 8, 5381.
  • Myllymäki et al. (2017) Myllymäki, M., Mrkvička, T., Grabarnik, P., Seijo, H. & Hahn, U. (2017). Global envelope tests for spatial processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 381–404.
  • Perrin and Senoussi (2000) Perrin, O. & Senoussi, R. (2000). Reducing non-stationary random fields to stationarity and isotropy using a space deformation. Statistics & probability letters 48, 23–32.
  • Prokešová et al. (2017) Prokešová, M., Dvořák, J. & Jensen, E. (2017). Two-step estimation procedures for inhomogeneous shot-noise Cox processes. Annals of the Institute of Statistical Mathematics 69, 513–542.
  • Rieux et al. (2014) Rieux, A., Soubeyrand, S., Bonnot, F., Klein, E., Ngando, J., Mehl, A., Ravigne, V., Carlier, J. & de Lapeyre de Bellaire, L. (2014). Long-distance wind-dispersal of spores in a fungal plant pathogen: estimation of anisotropic dispersal kernels from an extensive field experiment. PLOS ONE 9, e103225.
  • Sampson and Guttorp (1992) Sampson, P. & Guttorp, P. (1992). Nonparametric estimation of nonstationary spatial covariance structure. Journal of American Statistical Association 87, 108–119.
  • Schoenberg (2005) Schoenberg, F. P. (2005). Consistent parametric estimation of the intensity of a spatial-temporal point process. Journal of Statistical Planning and Inference 128, 79–93.
  • Soubeyrand et al. (2007a) Soubeyrand, S., Chadœuf, J., Sache, I. & Lannou, C. (2007a). A frailty model to assess plant disease spread from individual count data.

    Journal of Data Science

    5, 63–86.
  • Soubeyrand et al. (2007b) Soubeyrand, S., Enjalbert, J., Sanchez, A. & Sache, I. (2007b). Anisotropy, in density and in distance, of the dispersal of yellow rust of wheat: Experiments in large field plots and estimation. Phytopathology 97, 1315–1324.
  • Soubeyrand et al. (2011) Soubeyrand, S., Roques, L., Coville, J. & Fayard, J. (2011). Patchy patterns due to group dispersal. Journal of Theoretical Biology 271, 87–99.
  • Stein (1999) Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer-Verlag, New York.
  • Stoyan and Wagner (2001) Stoyan, D. & Wagner, S. (2001). Estimating the fruit dispersion of anemochorous forest trees. Ecological Modelling 145, 35–47.
  • Tanaka et al. (2008) Tanaka, U., Ogata, Y. & Stoyan, D. (2008). Parameter estimation and model selection for Neyman-Scott point processes. Biometrical Journal 50(1), 43–57.
  • Tufto et al. (1997) Tufto, J., Engen, S. & Hindar, K. (1997). Stochastic dispersal processes in plant populations. Theoretical Population Biology 52, 16–26.
  • Waagepetersen (2007) Waagepetersen, R. (2007). An estimating function approach to inference for inhomogeneous Neyman-Scott processes. Biometrics 63, 252–258.
  • Waagepetersen and Guan (2009) Waagepetersen, R. & Guan, Y. (2009). Two-step estimation for inhomogeneous spatial point processes. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 71, 685–702.
  • Zhuang (2015) Zhuang, J. (2015). Weighted likelihood estimators for point processes. Spatial Statistics 14, 166–178.