A Two-Stage Cox Process Model with Spatial and Nonspatial Covariates

08/10/2021
by   Claire Kelling, et al.
Penn State University
0

There are rich new marked point process data that allow researchers to study disparate problems such as the factors affecting the location and type of police use of force, and the characteristics that impact the size and location of forest fires. We develop a novel modeling approach for marked point processes that allows for both spatial and nonspatial covariates; both types of covariates are present in the examples we consider. Via simulated and real data examples, we find that our two-stage log Gaussian Cox process model is flexible and easy to interpret, and potentially useful in many areas of research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

03/23/2020

Approximate Bayesian inference for a spatial point process model exhibiting regularity and random aggregation

In this paper, we propose a doubly stochastic spatial point process mode...
10/11/2019

Burglary in London: Insights from Statistical Heterogeneous Spatial Point Processes

To obtain operational insights regarding the crime of burglary in London...
01/16/2020

Testing proportional hazards for specified covariates

Tests for proportional hazards assumption concerning specified covariate...
05/31/2020

Point Process Regression

Point processes in time have a wide range of applications that include t...
02/10/2021

Currents and K-functions for Fiber Point Processes

Analysis of images of sets of fibers such as myelin sheaths or skeletal ...
03/28/2020

Privacy for Spatial Point Process Data

In this work we develop methods for privatizing spatial location data, s...
03/25/2019

Real-Time Robotic Search using Hierarchical Spatial Point Processes

Aerial robots hold great potential for aiding Search and Rescue (SAR) ef...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Rich marked point process data are increasingly available in a number of domains. Such data enable researchers to answer interesting new questions. However, these data often pose new statistical and computational challenges. We consider two motivating examples: police use of force where the mark is the level of force used by police, and forest fires, where the mark is the amount of burned area caused by each fire. In addition to understanding the nature and driving factors of individual use of force incidents or forest fires, it is also critical for scientists, policymakers, and the public to develop a data-driven, systematic understanding of the important factors at play in driving outcomes generally.

The first motivating dataset we describe is police use of force events in the city of Dallas, Texas. Fully understanding police use of force requires a model for a point process that has multiple types of predictors, including both spatial and nonspatial variables. Examples of spatial variables that may impact police use of force are neighborhood-level characteristics such as the unemployment rate or neighborhood diversity. Nonspatial variables include both individual-level characteristics, such as officer or citizen race and officer training, and event-level characteristics, such as presence of citizen resistance or if the officer or citizen was injured. Our modeling framework determines the potential impact of these variables on the locations of events and on the level of force used by police (the mark). Examples of simulated marks and nonspatial variables following the structure of use of force data from Dallas, Texas in plotted in Figure 1, with spatial variables from the American Communities Survey plotted at the bottom.

Figure 1: Simulated Police Use of Force Data

In the second motivating example of forest fires, potential spatial variables of interest include elevation and terrain; potential nonspatial variables of interest include the time of year that the fire occurred and the cause of the fire (intentional or not). We are interested in determining the potential effect of these variables on the location of fires and the amount of burned area at those locations (the mark). Data from this application for the Castilla-La Mancha region of Spain is shown in Figure 2. We note that our model can incorporate multiple types of marks which is illustrated by our applications (categorical for use of force, continuous for forest fires) and multiple types of spatial variables (areal for use of force, geostatistical for forest fire data).

Figure 2: Forest Fire Data in Castilla-La Mancha, Spain

While there is considerable attention and statistical methodology for point process models involving spatial variables, such as the log Gaussian Cox process (cf. moller2003statistical), there has been less extensive work on incorporating spatial and nonspatial variables into point process models. Furthermore, it is important to have a computationally tractable inferential procedure for this model, and we would like the resulting inferences to be easy to interpret.

Here we develop a novel two-stage log Gaussian Cox process (LGCP) model for a marked point process. Our model utilizes both event/individual-level (nonspatial) and neighborhood-level (spatial) characteristics. We use this model to determine the impact of spatial variables on the locations of events and the relationship between both kinds of variables and the mark of the point process. We also allow for dependence between the location and mark determination processes. We find that inference for our model is reliable and that our approach provides an intuitive interpretation of the impact of spatial and event/individual-level on marked point processes.

Previous approaches to incorporating nonspatial variables into a point process framework have typically been in the context of public health data. liang2008analysis addresses spatial misalignment between spatial and nonspatial variables with an application to disease cases. Specifically, the authors characterize spatial variation of disease occurrence to identify areas with elevated risk. Their bivariate Gaussian process approach breaks new ground by specifying an approach for handling nonspatial covariates in point process models as well as allowing for dependence between marks. Our simulated examples suggest that their approach may pose some parameter identifiability and interpretability issues. best2000spatial studies individual-level outcomes of disease cases. The authors utilize individual-level covariate information from patients, such as sex and home conditions like dampness or parental smoking, as well as a spatial random field to account for unattributed spatially varying risk. They also incorporate the level of nitrogen dioxide, which is considered to be spatially varying. The model incorporates both individual-level and spatial variables in order to compare baseline and relative risk, under given sets of conditions. As in liang2008analysis, the model measures both location and attribute-specific risk intensity simultaneously, which makes it difficult to describe the spatial distribution of the outcome without choosing particular individual-level attributes.

quick2015bayesian uses both spatial and nonspatial information to study mortality rates in an LGCP framework, similar to the model proposed by liang2008analysis

. However, instead of treating nonspatial point-level information as covariates, they treat all nonspatial variables (race, sex, education, and cause of death) as marks through a cross-tabulation of all possible combinations of categorical nonspatial variables to create 24 categorical marks, each of which is called a group. The intensity is then estimated for each of these 24 combinations/groups. The method is applied to North Carolina mortality data, where the authors analyze the relationship between variables such as education and cause of death, where both are included in this cross-tabulation of marks. This is a rich model but parameter interpretation can pose some challenges, as it depends on groups that are defined by the specific combination of categorical marks, and it is difficult to incorporate continuous explanatory variables. As the number of nonspatial variables in the model increases, the number of groups would continue to grow, leading to interpretability and computational challenges.

pinto2015point studies cerebrovascular diseases (CVD) and the nonspatial/ individual-level variables that may play a role in the intensity of cases, namely age at death, education level, gender, and marital status. These nonspatial covariates are incorporated into the intensity function in a similar way to the model proposed by liang2008analysis but pinto2015point allows for additional spatial variation in the effect of the nonspatial covariates. The model includes three Gaussian processes: one for nonspatial variables, another for the interaction between spatial and nonspatial variables, and a third for an unmeasured spatial random effect. This is a very flexible model formulation. However, we are more interested in the effect of individual-level/nonspatial variables on the mark of the point process than on how the effect of these variables change over space. The model presented by pinto2015point is not extended to marked point processes. Furthermore, their flexible model results in large computational challenges, which the authors address by discretizing the analysis into sub-regions.

There is considerable research on augmenting areal spatial data with individual-level data from a sample of the population that is at risk. diggle2010estimating provides a review of many of these methods and discusses an alternative framework for incorporating the individual-level data of cases for a population of at risk individuals, where . diggle2010estimating presents a model for the cases based on an inhomogeneous spatial Poisson Process. The authors present a simulation study and an application to childhood meningococcal disease where they use individual-level data of the observed cases to estimate characteristics of the full at-risk population using aggregated covariate information over sub-regions of the space. This method divides the region into subregions that partition the region and then aggregates covariate information for each region in order to determine the number of individuals at risk in each subregion. As in liang2008analysis, individual and spatial data are incorporated in the same part of the model, and therefore it is assumed that individual information is available at any location .

We summarize the main contributions of this paper below.

  • [nosep]

  • We introduce a novel two-stage log Gaussian Cox process model that assigns locations based on spatial predictors in the first stage and marks based on nonspatial and/or spatial variables in the second stage. We test the inclusion of dependence structures both within and between stages of the model.

  • Current models (cf. liang2008analysis; diggle2010estimating; pinto2015point) that incorporate nonspatial variables by and large assume that they are available everywhere in the spatial domain by including them in the spatial intensity function. Such an assumption imposes some challenges conceptually as well as in practice. In particular, it becomes challenging to think of a straightforward approach to simulating from such models. We describe how to simulate point process data where the intensity is a function of both spatial and nonspatial variables; this is likely to be broadly useful for practitioners and perhaps also useful in specific applications such as synthetic data creation for preserving privacy (cf. quick2015bayesian).

  • We use simulation to understand the interpretability of inference based on our method, and compare it to a bivariate modeling approach (liang2008analysis; pinto2015point). Furthermore, we study how the availability of replicates may impact parameter inference.

2 Marked Point Process Models with Spatial and Nonspatial Predictors

We describe the general setup for log Gaussian Cox Process (LGCP) models in order to motivate two models: our new two-stage marked point process model and the bivariate Gaussian process developed by liang2008analysis. Point processes are defined by a set of n locations in a spatial region/window of interest, . In our case, these locations are police use of force incidents in Dallas, Texas or forest fires in the Castilla-La Mancha region of Spain. We assume the events come from a spatial point process with intensity , , where

consists of unknown parameters such as regression coefficients and covariance function parameters. The intensity may be determined by a vector of spatial covariates at location

, which we notate as . The spatial variables for our study of police use of force in Dallas are collected at the census tract level and points are assigned the census estimate of the census tract where they are located. Spatial variables for the forest fire data are geostatistical, and are provided on a fine grid over the region. Locations of the point process are assigned values for spatial variables of the nearest grid location. The intensity may also be impacted by a vector of individual/event-level or nonspatial covariates at that event, which we call (cf. liang2008analysis).

For a log Gaussian Cox process (cf. moller2003statistical), the intensity function may be defined as follows: , where consists of and , which are vectors of unknown spatial and nonspatial regression coefficients respectively. The vector also consists of parameters of , which is a zero-mean Gaussian process. We define the Gaussian process via a positive definite covariance function. Later we provide details about the form of the covariance function we use for each model. For estimation of the Gaussian processes in this paper, we rely on the predictive process approach, as outlined in the Supplemental Materials, Section 1.3.

The resulting likelihood function is given below, where the parameter vector typically consists of regression coefficients, such as and as described above, and parameters of the covariance function. Details of the estimation of the integral are given in the Supplemental Materials, Section 1.2.

(1)

We denote the mark at each point process location as , which could be continuous or categorical. When marks are categorical, such as type of force used by police, we call these point processes multitype point processes. We also consider a continuous mark in the case of forest fire data, where we model the amount of burned area in each fire. Existing approaches are often specified specifically for categorical (multitype) point processes (cf. waagepetersen2016analysis; moller1998log; brix2001space; liang2008analysis). We develop a framework that is flexible and can easily accommodate both continuous and categorical (multitype) marks.

2.1 Two-Stage Marked Point Process Model

The two-stage model we develop here specifies a first stage model that determines the locations of the point process events and a second stage which determines the marks at those locations. Two-stage models, sometimes referred to as hurdle models, have been developed in the context of areal data (cf. ver2007space; michaud2014estimating) as well as point-referenced data (cf. recta2012two).

In our model, the nonspatial covariates are not used to determine the locations of the events, but rather serve as part of the mark-determination stage. Nonspatial covariates may not be available across the spatial domain but are typically only observed where a point process event occurs, which motivates this model structure. Nonspatial variables may be associated with an event (such as arrest/hospitalization at an incident) or with an individual involved in an event (officer/citizen race/gender). The first stage of this model consists of an LGCP based on spatial variables and a Gaussian process, where the intensity takes the following form: . For the Gaussian process, , we assume a univariate exponential covariance function so that for any the covariance is defined such that , with parameters .

The second stage of the model, in the case of a categorical mark, consists of logistic regression to determine the mark, given a location. The logistic regression could include any combination of nonspatial and spatial variables. The logistic regression takes the following form, where

is the mark of location (either 0 or 1):

Depending on the particulars of the application, spatial predictors can be included or excluded from the second stage of the model. This two-stage model may be more appropriate than alternatives when the relationship between covariates (especially nonspatial covariates) and marks are of interest, rather than the dependence between types of points. Dependence between types of points may be better captured through a bivariate Gaussian process approach as described in liang2008analysis.

In the most general and flexible formulation of our model, we allow for spatial dependence in the two stages of our model. We assume a Gaussian process for the error in both stages of the model, and then incorporate a dependence structure between the two Gaussian processes. Following liang2008analysis, the Gaussian processes and are defined such that where the cross-covariance matrix is of dimension , where is the number of points. The cross-covariance matrix consists of both a univariate exponential correlation function, as defined above where , and the matrix , as defined below. The full cross-covariance function is (cf. liang2008analysis). If is close to 1, this would indicate strong dependence, and close to 0 would indicate weak dependence.

This formulation leads to the following two-stage model, described in Equation 2, with the first stage determining the location and the second stage determining the mark of the point process. This dependence structure allows for dependence in the spatial structure that determines both where use of force events occur and what level of force occurs at that location.


Model Number First Stage Second Stage
Model 1 NHPP
Model 2 Univariate Gaussian Process
Model 3 Univariate Gaussian Process Univariate Gaussian Process
Model 4 Bivariate Gaussian Process
Table 1:

Two-Stage Models Considered. In the case of linear regression (continuous mark), we consider standard normal error in the second stage of the model, as shown in the table. In the case of logistic regression (categorical mark), we consider no error term in the second stage of the model.

(2)

Our general framework can lead to the special cases that are presented in Table 1. We consider these cases in the simulated results as well as fitted to the forest fire data to compare model fit. In the case of a continuous mark, the second stage of the model is linear regression rather than logistic regression. For categorical marks/logistic regression, we do not include an error term in the second stage of the model when there is no Gaussian process. For continuous marks/linear regression, we consider iid standard normal error in the second stage of the model, with parameter .

The likelihood for the two-stage model, including the logistic regression portion, is included below, where and .

(3)

We assume Normal(0,100) priors for all of the regression coefficients (). We use an Inverse-Gamma(, ) prior for , as in liang2008analysis. As in other studies (cf. liang2008analysis), we fix due to identifiability issues. We want the covariance function not involving , , to be small when points are far apart and large when points are close together. We find the value of so that the 95 percentile of distances would have a a correlation of 0.05, and the value of so that the 5 percentile of distances would have a a correlation of 0.95. We fix at the average of these two values.

1:Define locations, . Simulate a Gaussian process, with parameters .
2:Evaluate the intensity at and compute the maximum intensity, .
3:Simulate a homogeneous Poisson Process with parameter on , where is the area of .
4:Affiliate each simulated point with one of . This is to assign a value of to each simulated point of the homogeneous Poisson process.
5:

Calculate the probability of keeping the point at location s,

. Thin the points by keeping each point with this probability, otherwise remove.
6:Simulate nonspatial variables according to their distribution and assign values to each point from the thinned homogeneous Poisson process.
7:Assign the location s to mark 1 with probability, , where . Otherwise, assign to mark 0. : In our approach, we found it helpful to generate based on a grid of equally spaced points in the observation window, . : When computing the maximum data using spatial variables available on the areal level, at least one point per areal unit is needed. We use a number of points per unit proportional to the area of the areal unit. : Options for affiliating each simulated point with include affiliation with the closest point or predictive process transformations.
Algorithm 1 Two-Stage Marked Point Process Simulation

We fit our model to data simulated from the two-stage model via Algorithm 1. In order to simulate from this model, we assume a spatial window of interest with area (in the simulated example, it is the Dallas city boundary). For our simulated example, we also assume that has been broken into areal units and we have spatial variables affiliated with those areal units. We use the Census data available to us from census tracts, but these variables may also be simulated. Spatial data could also be geostatistical, as with the elevation data in the forest fire dataset. We now describe in Algorithm 1 how to simulate from a two-stage marked point process with parameters , the regression coefficients for the nonspatial variables, , the coefficients for the spatial variables that are used in the first stage, , the coefficients for the spatial variables that are used in the second stage, and covariance parameter

. We simulate nonspatial variables from Beta and Bernoulli distributions, but we find that the two-stage model is not sensitive to choice of distribution when other distributions are tested.

First, we calculate the maximum intensity across the space () in order to simulate a homogeneous Poisson process based on this maximum intensity. In order to calculate the maximum intensity, there needs to be at least one point per areal unit, or census tract in our case, in order to ensure that we accurately calculate the maximum intensity. Next, we thin the homogeneous points based on the intensity , which in our case only depends on spatial variables, to determine the final locations present in our dataset, which is shown in Figure 3. There are no restrictions on the distributional assumptions of the nonspatial variables. For instance, unlike the bivariate model (Section 2.2), they need not be uniform or bounded. After the thinning procedure, both the marks and nonspatial variables are assigned to the remaining locations, as shown in Figure 3.


(a) Calculate :

(b) Simulate Homogeneous Poisson Process:
(c) Thinning Based on Spatial Variables:
(d) Assign Mark:
Figure 3: Simulation Process for Two-Stage LGCP Model

2.2 Bivariate Model

A bivariate marked point process model was developed by liang2008analysis to study two types of cancer and their relationship to each other and with community-level and individual characteristics. The model includes mark-dependent coefficients for both spatial and nonspatial variables. Dependence between categorical marks is introduced through a dependent Gaussian process structure, where and follow the dependence structure shown in Section 2.1. Instead of the Gaussian processes being affiliated with each stage as in the two-stage model, the Gaussian processes are affiliated with each level of the mark, or level of force in our simulated example.

In general, the mark-dependent intensity function for marks , notated , takes the following form, with mark dependent coefficients and corresponding to spatial covariates at location , , and nonspatial covariates, , respectively. Finally, , are the Gaussian processes affiliated with each level of the mark.

The likelihood for the bivariate model including both spatial and nonspatial variables and for levels of the mark is given below, where :

(4)

The intensity for each mark level , , could be specified with or without the Gaussian process . The Gaussian process specification allows for the additional flexibility of modeling dependence between the mark levels.

As in the two-stage model, we use Normal(0,100) priors for all of the regression coefficients ( for marks ). We use an Inverse-Gamma(, ) prior for and and a Uniform (-0.999,0.999) prior for , as in liang2008analysis.

The full simulation algorithm for the bivariate model is included in Algorithm 1 in the Supplemental Materials, Section 2. The steps are depicted in Figure 4. Of particular concern, we note in the first step the need to calculate for each of mark levels, where consists of both spatial and nonspatial variables, shown in Figure 4. This is intuitive for spatial variables but the procedure is not so clear when nonspatial variables are incorporated into the intensity function. In order to calculate for each mark level, we require bounded distributions for the nonspatial variables; this is also required for model inference as shown in the Supplemental Materials, Section 1.1. After calculating for each mark level , we again simulate a homogeneous Poisson Process, but this time with a separate point process for each mark level. Next, we simulate nonspatial variables for each point. Finally, we thin the point process by calculating the intensity at each point and thinning points based on their probability in relation to the maximum intensity to obtain the final point process based on both spatial and nonspatial variables, shown in Figure 4.


(a) Calculate for marks:

(b) Simulate Homogeneous Poisson Processes for marks:

(c) Simulate Nonspatial Variables:

Based on uniform distributions


(d) Mark-Dependent Thinning:
Figure 4: Simulation Process for Bivariate Model

3 Applications to Simulated Data

We studied our modeling and inferential approach via simulated examples. Our simulated data closely mirrors the police use of force data from Dallas, Texas, where we create a marked point process with both spatial, using Dallas census tracts, and nonspatial covariates of multiple types. Given several open questions about the use of force dataset, in Dallas, Texas and otherwise, in this manuscript we restrict our attention to the simulated point process data, resembling the use of force data, and the forest fire application. The areal data is not simulated, and is collected from the 2013-2017 American Communities Survey on the census tract level. For spatial variables, we use one variable that is continuous, such as median age, which we denote with a corresponding coefficient . We also use another spatial variable that is bounded between 0 and 1, such as the unemployment rate or the Herfindahl index, which we denote with a corresponding coefficient of . For nonspatial variables, we consider one variable that is continuous, to represent a variable such as an officer’s tenure on the police force, which is denoted by with a corresponding coefficient of . We also use a nonspatial variable that is binary, to represent an indicator variable such as male officer, which is denoted with a corresponding coefficient of . For both the two-stage and bivariate models, we consider models with and without a Gaussian process in our simulation framework. For both the two-stage and bivariate models we analyze simulated data generated through thinning procedures, as outlined above, to generate point processes with the appropriate intensity function. We fit both the two-stage and the bivariate models using the NIMBLE R package (nimble).

3.1 Two-Stage Model

In our first simulated example, we consider the model that includes spatial variables in the LGCP stage of the model and only nonspatial variables in the mark determination stage of the model, and we do not include a Gaussian process in either stage of the model. We find that we are able to recover the logistic regression estimates for the nonspatial variables. The coefficients for the spatial variables are recovered well when we do not include a Gaussian process.

Figure 5: We simulate data from the two-stage model, with a Gaussian process in the first stage of the model. The posterior densities of all the parameters for three replicates of this data are shown in red, green, and blue. When the replicates are fit simultaneously, by multiplying the likelihoods, the posterior densities for the parameters are shown in orange.

When we add a Gaussian process to the first stage of the model, we find that we are able to recover regression parameters associated with spatial and nonspatial variables again reasonably well. However, the estimate of from the covariance of the Gaussian process is not as accurate. It is often difficult in LGCP-type models to tease apart information about these parameters from a single realization of the point process. To study this further, and because there are situations where replications are available, we consider the situation where there are multiple realizations, or replicates, of the point process, using the same parameters. In Figure 5, the red, blue, and green lines represent the inferred values using just a single realization, that is, when we consider each of the three replicates separately. We see that they range rather widely, especially for the intercept, and the covariance parameter, .

However, when we infer parameters based on all three replicates at once, we find that we better estimate all the parameters, including and (orange curves in Figure 5). The likelihood function here is simply the product of the three likelihoods in Equation 3 across the individual replicates, assuming independence between replicates.

Two-stage model with cross-correlation

We also simulate from and fit the more general version of the two-stage model shown in Equation 2 to study the impact of the flexible assumption of dependence across the two stages of the model. We test all possible combinations of simulated data and models fitted to that data from Table 1, including fitting simple models to more complex data, and complex models to simple data. This allows us to test whether our modeling procedure shows an improved fit and can recover parameters from the more complicated model where there is dependence between the two stages.


Simulated Model (Truth)
Fitted Model Model 1 Model 2 Model 3 Model 4
Model 1 -17,954.09 -17,583.53 -8,917.32 -22,075.13
Model 2 -17,955.55 -18,069.73 -9,817.12 -23,807.32
Model 3 -17,989.75 -18,231.51 -9,860.67 -23,930.88
Model 4 -17,975.52 -18,213.73 -9,892.37 -23,968.50
Table 2: Model Comparison Two-Stage Model, Simulated Data vs Model Fit (WAIC). Rows represent different models that were fit to the data and columns represent the model that the data was simulated from. Bold values indicate best model fit.

We present the results in Table 2. We find that the bivariate and univariate (univariate GP in both stages) models have the lowest WAIC and therefore best model fit, regardless of the form of the simulated data. The WAIC values tend to be similar between the two models. When the second stage of the model does not include a GP (Models 1 and 2), the univariate GP model provides the best fit. When the second stage of the model includes a GP (Models 3 and 4), regardless of dependence in the GP, the bivariate GP model provides the best fit. In Table 2, we include the simulated case of the bivariate GP model where is 0.9.

In the Supplemental Materials (Section 3, Table 1), we also illustrate our findings for different values of in the case where data is simulated from the bivariate GP model. We find that the WAIC values indicate a similar model fit for both the bivariate GP model and the model with a univariate GP in both stages of the model, regardless of the value of that is used. We also find that is consistently overestimated, even when is small (0.5).

As shown in the Supplemental Materials, Section 3, the fit is similar between the univariate and bivariate GP models. Due to the increased computational burden of fitting a bivariate GP model and the similar fit between the univariate GP and bivariate GP models, we recommend fitting the univariate GP model as the default as it seems to work well across different cases.

Figure 6: We simulate data from the bivariate model, without a Gaussian process, and compare the results for the when the nonspatial variables are simulated from uniform or nonuniform distributions. The posterior densities for all parameters are shown in black when we include nonspatial variables simulated from uniform distributions and blue for nonspatial variables simulated from nonuniform distributions. The true parameters from the simulation are shown in red.

3.2 Bivariate Model

We test many different cases of the bivariate model with both spatial and nonspatial variables. From our results, we suspect that there are identifiability issues between the nonspatial variables and the intercept, as the intercept is not recovered well when we include nonspatial variables, even when we do not include a Gaussian process. In Step 5 of the bivariate simulation algorithm as described in the Supplemental Materials, Algorithm 1, we simulate nonspatial variables from pre-specified distributions. We find that we are only able to recover nonspatial variables when they are simulated from uniform/equal probability distributions. As shown in the Supplemental Materials, Equation 1, bounds on continuous distributions are required in order to calculate the integral of the intensity function. However, we find that even when we simulate from a bounded distribution that is not uniform, we cannot successfully recover the nonspatial variable coefficients. When we draw the realizations of the nonspatial variables from a uniform distribution, we are able to effectively recover the nonspatial variable coefficients, but still are unable to recover the intercept. When the nonspatial variables are binary, we also find that the only way to recover nonspatial coefficients is to simulate with equal probability from 0 and 1.

We illustrate our findings in Figure 6, where the models are estimated without a Gaussian process. The black line shows the posterior distributions of the nonspatial parameters where the associated variables are simulated from a uniform distribution for the continuous variable (

) and a Bernoulli distribution with equal probability on 0 and 1 for the binary variable (

). We see that, with the exception of the intercept, the spatial and nonspatial variables are recovered reasonably well in most cases. We note that even though the variables are simulated from these uniform/equal probability distributions, our simulation process relies on a thinning procedure, as described in Algorithm 1 in the Supplemental Materials, and the distributions of the nonspatial variables after thinning no longer resemble uniform/equal probability distributions. The blue line in Figure 6 shows the posterior distributions when the nonspatial variables are simulated from non-uniform distributions and Bernoulli distributions with unequal probability. The nonspatial variables are not estimated well, and the spatial variables are recovered similarly well to the case above.

We illustrate our results with a Gaussian process in Figure 7. When we add the bivariate Gaussian process to this model, with cross-covariance , we find once again that we are unable to recover the intercept parameters, for either mark. We are able to recover both spatial () and nonspatial () variable coefficients relatively well, when the are once again simulated from uniform/equal probability distributions. The main inferential challenge in this method is estimating the parameters of the bivariate Gaussian process. We find that we are able to recover a difference between and , namely that is smaller than , but we do not recover the true value well. We do not recover the parameter well, as our point estimate for is around 0.7, while the true value is 0.5.

Figure 7: We simulate from a bivariate model with a bivariate Gaussian process between the two simulated levels of force. The posterior densities for all parameters are illustrated in black, and the true parameter from the simulation is shown in red.

4 Application to Castilla-La Mancha Fire Data

Next, we apply the two-stage model to data on forest fires from the Castilla-La Mancha region of Spain, shown in Figure 2. The fire data used to evaluate the model is the clmfires forest fire data from the spatstat package in R (spatstat). Nonspatial variables with the fire data indicate the cause of the fire and whether the fire occurred in the summer season. Spatial variables include elevation and other characteristics of the land, such as land cover and slope, which are available on a fine grid. We are also interested in characterizing which factors impact the amount of land burned in each forest fire. We consider 3,657 fires from 2004 to 2007. Data collected before 2004 were recorded with less precision than after 2004, so we exclude these cases. We also exclude fires that had “other” listed as the cause of the fire, as we are interested in comparing intentional fires to those caused by lightning or an accident. In the first stage of the model, we test the following spatial variables and their impact on determining the locations of forest fires in the Castilla-La Mancha region of Spain: indicator variable for if the location is in a forest, elevation, and the slope at that location. In the second stage of the model, we use the following nonspatial variables: an indicator variable for if the fire was set intentionally (compared to being set by lightning or an accident) and an indicator variable for if the fire occurred in the summer season. We combine the spatial and nonspatial variables listed above to determine their potential impact on the mark of the point process, or the amount of land burned, in hectares, by a given fire. Therefore, this model is slightly modified from the two-stage model shown in Equation 2, where the second stage is now linear regression, where previously it was logistic regression in the case of simulated police use of force incidents. This is an advantage of this model, as it allows the user to consider many different kinds of marks, both continuous and categorical, whereas the multivariate Gaussian process model proposed by liang2008analysis allows only for categorical marks.

alba2018homogeneity analyze the clmfires data by comparing two different fire seasons per year. The authors find it is important to consider whether fires occur in the summer season during some years, motivating the use of this nonspatial covariate in our model. myllymaki2020testing use the clmfires data to model local covariate effects. Specifically, they incorporate elevation and an indicator variable for whether the fire occurred in a forest in the spatial model, motivating our use of these variables as well. dvovrak2020nonparametric

develop nonparametric methods to study the relationship between points, marks, and covariates. Through the use of test statistics, they conclude that there is dependence between the points, marks, and covariates that they consider (the mark is the area burned and the covariate is elevation). The authors consider only fires in 2007, and when studying the relationship between points and (spatial) covariates alone, they find no relationship between elevation and locations of fires, which will be tested in the first stage of our model.

We fit four models to the fire data. The first model considers a non-homogeneous Poisson process (NHPP) in the first (location-determination) stage and standard normal error in the second (mark-determination) stage. In the second model, the first stage has a Gaussian process and the second stage has standard normal error. In the third model, there is a univariate Gaussian process in both stages of the model, with no dependence between them. In the final model, we consider dependent Gaussian processes in the first and second stages, as shown in Equation 2.

The results for model comparison are presented in Table 3. We find that the lowest WAIC is for the model which includes a univariate GP in both stages of the model, with no dependence between the stages. This is followed by the bivariate Gaussian process model, with dependence between the stages.


Model Type First Stage Second Stage WAIC
Model 1 NHPP -3,866.11
Model 2 Univ. Gaussian Process -2,328.69
Model 3 Univ. Gaussian Process Univ. Gaussian Process -7,472.51
Model 4 Biv. Gaussian Process -5,657.39
Table 3: Model Comparison Two-Stage Model, Castilla-La Mancha Spain Forest Fire Data

Next, we analyze the effect of the variables included in both stages of the model. We show the results for the model with a univariate GP in both stages in Table 4

, and the results for all models in Table 2 in the Supplemental Materials. Values in bold in the table are coefficients whose credible intervals do not include 0. In the first stage of the model, or the location-determination stage, we find that in all models that include at least one Gaussian process (and therefore excluding the NHPP model), elevation has a negative impact on the intensity of forest fires, and slope has a positive impact. This would indicate that at locations with higher elevation and regions with a smaller slope, we are less likely to see forest fires. In the NHPP model, we find elevation has a positive impact and the credible for slope includes 0.

In the second stage of the model, where we determine which variables have a potential impact on the amount of land burned in the forest fire, we find that the results are dependent on whether a Gaussian process is included in the second stage of the model. When we do not include a GP in the second stage, the indicator variable for whether the fire was set intentionally and the elevation both have an impact on the amount of area burned by the fire. If the fire was burned intentionally and at lower elevations, we would expect more land to be burned. The negative impact of elevation on area burned is consistent with exploratory analysis conducted by dvovrak2020nonparametric. After including a Gaussian process in the second stage of the model, we find that the indicator variable for summer and elevation play a role in the amount of area burned. Specifically, elevation has a negative impact on amount of area burned as before, and the summer indicator variable also has a negative impact, where fires that are burned in the summer are likely to have a smaller area burned. With the dependent Gaussian processes, slope also has a negative impact on the amount of area burned, where a higher slope would lead to less area burned.

Finally, we consider the dependence parameter, , in the fourth model which includes a bivariate Gaussian process. The credible interval for includes 0, which indicates no dependence between the two stages of the model. This is supported by the WAIC indicating that the third model, which includes univariate Gaussian processes in both stages of the model, is a better fit to the data.


Stage 1 Stage 2
Variable Mean (95%CI) Mean (95%CI)

Spat Var

(intercept) 4.1115 (3.9397, 4.2862) 0.0571 (-0.3231, 0.4236)
Forest (ind.) -0.0494 (-0.1531, 0.0537) 0.0479 (-0.1735, 0.2697)
Elevation -7e-04 (-9e-04, -5e-04) -8e-04 (-0.0013, -4e-04)
Slope 0.0089 (1e-04, 0.0171) -0.0121 (-0.0276, 0.0028)
or 5.3444 (4.8574, 5.8351) 6.694 (5.6753, 7.7473)
(iid) 2.1898 (2.1389, 2.2415)

Nonsp

Intentional (ind.) 0.0318 (-0.1314, 0.2)
Summer (ind.) -0.2318 (-0.3787, -0.0807)
Table 4: We present the posterior mean as well as the 95% Credible Intervals for all covariates in our analysis of the Spain Fire data in the Castilla-La Mancha region of Spain. Values in bold indicate coefficients whose credible intervals do not contain 0. We present the estimates for Model 3 (univariate GP in both stages) below and the full results in the Supplemental Materials (Table 2).

5 Discussion

Through this study, we have demonstrated the utility of the two-stage model in estimating the relationship between spatial and nonspatial variables and marked point process outcomes. Through the two applications discussed in this paper, we have also shown how the model can be applied to multiple types of marks, such as continuous and categorical. Specifically, with the police use of force application, we are interested in nonspatial variables such as officer and citizen race/gender, officer tenure, citizen resistance, and the number of citizens and officers involved. Spatial variables include community factors, such as median income and neighborhood diversity. These spatial and nonspatial variables could be related to the mark, or the level of force that is used by the police officer. Through a simulated example created to closely resemble police use of force data, we have shown that this two-stage model provides interpretable and accurate estimates of relationships between covariates and both the intensity and mark of point processes.

For the forest fire data, we are interested in determining which factors play a role in both determining the location of forest fires and the amount of burned area at the fire locations. We show that this model provides a clear interpretation of nonspatial and spatial variables and the relationship with these marked point process outcomes by splitting the model into two stages: location determination and mark determination. We show that the factors impacting the location of fires and the factors impacting the amount of burned area at those locations are not necessarily the same, and accounting for spatial dependence when determining the impact of variables on the mark is important.

We anticipate that this new two-stage model could be widely applicable to other types of datasets. For example, liang2008analysis studies the impact of nonspatial variables on cancer. The authors parameterize their results of the bivariate model to show fixed effects on one mark, and differential effects for the other mark, where they hope to show if the variable has a different impact on the two cancers. In the first stage of the model, we would determine if the spatial variables contribute to an increased intensity of the outcome of interest, in liang2008analysis’s case this would be cancer. In the second stage of the model, we could incorporate spatial and/or nonspatial variables to determine the perhaps differential impact of the variables on the different marks, or types of cancer. The dependence structure in the bivariate Gaussian process could be used to test for dependence between the location-determination and the marking processes.

In the future, we would like to apply this model to use of force data provided through the Police Data Initiative and Open Data city websites to conduct an in-depth analysis of the impact of both spatial and nonspatial variables on police use of force, and how these trends compare across cities and jurisdictions.

6 Acknowledgements

The authors would like to thank Aleksandra Slavković, Ephraim Hanks, and Corina Graif for helpful conversations. This project was supported by Award No. 2020-R2-CX-0033, awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice. The opinions, findings, and conclusions or recommendations expressed in this publication/program/exhibition are those of the author(s) and do not necessarily reflect those of the Department of Justice.

7 Appendix

7.1 Computational Details

7.1.1 Integration and Nonspatial Variables

The integral presented in Equation 4 cannot be evaluated explicitly and must therefore be approximated. If we omit the mark, the integral to be approximated is as follows:

When we consider an intensity function of the form , where there is no interaction between the spatial and nonspatial variables, then the integral can be separated by spatial and nonspatial variables, shown below.

As in liang2008analysis, if a nonspatial variable in is continuous, we must specify an upper and lower bound for the covariate, say and , respectively. However, if is categorical or discrete, it is less clear how to incorporate these nonspatial variables into the integral. liang2008analysis and banerjee2014hierarchical both indicate that integrals over discrete variables should be replaced by sums but also that when is categorical, than the integration is only over the spatial domain,

. In our application, we are interested in nonspatial categorical (binary) variables such as officer gender, citizen gender, and if the officer and the citizen are the same race or not. We find that when we do not incorporate these categorical variables into the integral, the estimates for

are not accurate.

Therefore, we outline how to evaluate this integral while incorporating nonspatial variables. We provide an example where is continuous over a bounded distribution from to and is a binary indicator variable. The integral over is the sum of the function evaluated at 0 and 1, as suggested by banerjee2014hierarchical where the authors suggest that discrete variables are integrated with sums. This is easily extendable to additional continuous and binary variables.

(5)

7.1.2 Integration Methods

We investigated many different integration methods in this study. It is important to get an accurate measure of the integral of the intensity function in order to accurately conduct inference. First, we find it is important to include at least one point in each areal unit (in our case, census tracts) as the intensity function depends on the evaluation of the function over each areal unit. The spatial variation of the intensity function depends only on the census tracts and the Gaussian process. Therefore, it is important to include more than one point in large census tracts to capture the variation in the Gaussian process. As in liang2008analysis, we set the number of integration points per census tracts as a minimum of one and otherwise proportional to the area. We place these points randomly in the given census tract after determining the number of points to be placed in that census tract. These integration points are shown in Figure 8 as the black crosses. We test different integration methods over the region, such as Delaunay triangulation, and we find that a quasi-Monte-Carlo estimate over the integration points produces similar estimates, and we proceed with a quasi-Monte-Carlo estimate for the integral.

7.1.3 Predictive Processes

The Gaussian process,

needs to be estimated as a finite dimensional random-variable. However, estimating a Gaussian process with a

covariance matrix is computationally intensive. Therefore, we use the predictive process approach advocated for by banerjee2008gaussian. We use a set of knots evenly distributed over the window of interest, as in liang2008analysis. These knots are shown in Figure 8 as the blue circles. If are the predictive process knots, and is a realization of the Gaussian process over the knots, then we transform this realization over the knots to be realizations over both the observed events as well as the integration points, discussed above. The multivariate predictive process realization, , is calculated as follows:

For the two-stage model, if is the number of knots in the predictive process and is the number of points in the point process, then is the covariance matrix between the two Gaussian processes. Next, is the covariance matrix of the Gaussian process over the knots, (banerjee2008gaussian; banerjee2014hierarchical). Then the resulting is the mean-zero predictive process.

For the bivariate model, if is the number of marks and again is the number of knots in the predictive process and is the number of points in the point process, then is the covariance matrix between the two Gaussian processes. The covariance matrix for the bivariate model involves both the exponential correlation function as well as the cross-covariance function. As in the two-stage model, is the covariance matrix of the Gaussian process over the knots, . Then the resulting is the mean-zero predictive process.

Figure 8: Integration Points and Predictive Process Knots

7.2 Bivariate GP Simulation Algorithm

Below, we include the detailed simulation algorithm for the bivariate marked point process (Algorithm 1 in the Supplemental Materials). The simulation algorithm for the two-stage model is shown in Section 2.1. Both algorithms are structured based on the thinning technique, where we start with a homogeneous Poisson Process and we thin to a point process with a specified intensity function. In order to do this, we need to calculate the maximum possible intensity over the region, based on our pre-specified intensity function that we are simulating from. For the two-stage model, the marks are incorporated in the last step of the simulation process. For the bivariate model, the intensity involves mark-dependent parameters, so the maximum intensity must be calculated for each mark. Another key difference between the bivariate and two-stage simulation procedures is that in the bivariate model, the marks and nonspatial variables are simulated before thinning, where in the two-stage models, these variables are assigned after thinning.

The bivariate marked point process has parameters for each mark (, the regression coefficients for the nonspatial variables, for each mark (, the coefficients for the spatial variables, and covariance parameters and . As shown in Section 3.2, we need to simulate nonspatial variables from uniform distributions/equal probability distributions for the bivariate model.

1:Define locations, . Simulate a bivariate Gaussian process, and with parameters and .
2:Decide the bounds of continuous nonspatial variables from their distributions in order to calculate the maximum contribution to the intensity function by the continuous nonspatial variables, .
3:Calculate maximum intensity for each mark, and .
4:Simulate a homogeneous Poisson Process for each mark, with parameter for mark , on , where is the area of .
5:Simulate nonspatial variables according to their distributions and assign values of the nonspatial variables to each point from the unthinned homogeneous Poisson process.
6:Affiliate each simulated point with one of . This is to assign a value of and to each simulated point.
7:Thin the points by keeping the location of mark in the homogeneous Poisson process with probability , otherwise remove. : In our approach, we found it helpful to generate based on a grid of equally spaced points in the observation window, . : Options for affiliating each simulated point with include manual affiliation with the closest point or predictive process transformations.
Algorithm 2 Bivariate Marked Point Process Simulation

7.3 Two-stage Model with cross-correlation, Dependence Tests

As discussed in Section 3.1, we simulate data with many different true values of , shown in Table 5. We find the WAIC indicates similar model fit between the model with a bivariate GP and the model with a univariate GP in both stages. The parameter is also consistently overestimated when fitting the bivariate model, regardless of the true value of .


Fitted Model BiGP () BiGP () BiGP ()
NHPP-IndNorm -22,056.83 -22,075.13 -22,099.14
UniGP-IndNorm -23,796.88 -23,807.32 -23,820.66
UniGP-UniGP -23,867.00 -23,930.88 -23,998.40
BiGP -23,835.87 -23,968.50 -23,985.42
Table 5: Model Comparison Two-Stage Model, Simulated Data vs Model Fit (WAIC). Rows represent different models that were fit to the data and columns represent different values of the parameter used to simulate data from the dependent bivariate Gaussian process model. Bold values indicate best model fit.

7.4 Posterior Mean Estimates, Models 1-4, Forest Fire Data

Table 6 includes results for models 1-4, as applied to the forest fire data from the Castilla-La Mancha region of Spain.


NHPP-IndNorm (Mod 1) UniGP-IndNorm (Mod 2)
Variable Mean (95%CI) Mean (95%CI)

Stage 1

(intercept) 3.6317 (3.527, 3.7371) 4.2412 (3.9488, 4.5101)
Forest (ind.) -0.0109 (-0.1054, 0.0862) -0.0602 (-0.1669, 0.0461)
Elevation 2e-04 (1e-04, 3e-04) -9e-04 (-0.0012, -5e-04)
Slope 0.0049 (-0.0013, 0.0112) 0.0098 (0.0017, 0.0178)
6.3267 (5.3745, 7.727)

Stage 2

(intercept) 0.5792 (0.2369, 0.9149) 0.5764 (0.2287, 0.9227)
Intentional (ind.) 0.3963 (0.221, 0.5715) 0.3973 (0.2192, 0.5706)
Summer (ind.) -0.1157 (-0.2781, 0.0435) -0.1164 (-0.2773, 0.0458)
Forest (ind.) -0.225 (-0.4555, 0.0128) -0.2259 (-0.4594, 0.0076)
Elevation -0.0016 (-0.002, -0.0013) -0.0016 (-0.002, -0.0013)
Slope -0.0053 (-0.0194, 0.009) -0.0053 (-0.0194, 0.0088)
(iid) 2.4201 (2.3647, 2.4749) 2.4198 (2.3647, 2.4759)
UniGP-UniGP (Mod 3) BivGP (Mod 4)
Variable Mean (95%CI) Mean (95%CI)

Stage 1

(intercept) 4.1115 (3.9397, 4.2862) 4.0827 (3.9061, 4.2671)
Forest (ind.) -0.0494 (-0.1531, 0.0537) -0.0611 (-0.166, 0.0401)
Elevation -7e-04 (-9e-04, -5e-04) -6e-04 (-8e-04, -4e-04)
Slope 0.0089 (1e-04, 0.0171) 0.0105 (0.003, 0.0182)
5.3444 (4.8574, 5.8351) 5.8369 (5.0705, 6.6663)

Stage 2

(intercept) 0.0571 (-0.3231, 0.4236) 0.2664 (-0.2435, 0.7655)
Intentional (ind.) 0.0318 (-0.1314, 0.2) 0.1046 (-0.0834, 0.2855)
Summer (ind.) -0.2318 (-0.3787, -0.0807) -0.2072 (-0.3604, -0.051)
Forest (ind.) 0.0479 (-0.1735, 0.2697) -0.0894 (-0.3238, 0.1393)
Elevation -8e-04 (-0.0013, -4e-04) -9e-04 (-0.0016, -2e-04)
Slope -0.0121 (-0.0276, 0.0028) -0.0157 (-0.0306, -5e-04)
(iid) 2.1898 (2.1389, 2.2415) 2.2498 (2.1709, 2.3471)
6.694 (5.6753, 7.7473) 8.1362 (6.7259, 10.0697)
-0.0395 (-0.1598, 0.0836)
Table 6: We present the posterior mean as well as the 95% Credible Intervals for all covariates in our analysis of the Spain Fire data in the Castilla-La Mancha region of Spain. Values in bold indicate coefficients whose credible intervals do not contain 0.

References