This paper concerns preferential sampling (PS), where the locations selected to monitor a spatio–temporal process , depend stochastically on the process they are measuring. PS is a special case of response–biased sampling. The space–time point is defined as . denotes the spatial domain of interest and denotes the temporal domain. Purely spatial processes (i.e. when is a singleton) is a special case.
To gain understanding of the process , a set of time points at which to observe are selected. Then, for each , a set of sampling locations are chosen. Generally, the temporal domain is a finite set, with a time–averaged quantity for practical reasons. Typically, is not observed directly and instead a noisy observation is taken instead. The noise could be due to the presence of measurement error (i.e. the nugget effect) or other factors. may represent a set of points in space (i.e. ), or a set of well-defined areal units (i.e. ). In this paper, these two cases are referred to as the geostatistical and discrete spatial settings respectively. In the latter case, observations will generally represent spatial–averages of .
Difficulties arise with the estimation ofwhen are selected in a preferential way. This is because most statistical methods for modeling spatio-temporal data, condition on the locations as fixed (diggle2010geostatistical; cressie2015statistics). Such models assume that locations were selected under complete spatial randomness. Departures from this assumption in the true sampling scheme can lead to large biases in the prediction of the process (watson2018general). Hereafter, models that consider the locations as fixed are referred to as ‘naive’.
Additionally, a set of covariates may exist that influence the choice of sampling locations . These covariates may also be associated with the underlying process being modeled . When this occurs, including the neccessary in a ‘naive’ regression model for may partially remove the deleterious effects of PS on the spatio-temporal prediction of (gelfand2012effect). This becomes a regression-adjustment approach to help correct for the departure from a complete spatial randomness sampling design for . Covariates common to both the sampling process and , are hereafter referred to as ‘informative’ as in gelfand2012effect.
Preferential sampling has been identified as a major concern across multiple fields. In ecology, PS may occur due to sightings data being comprised of opportunistic sightings or poorly-designed surveys. Observers frequently focus their efforts in areas where they expect to find the species, leading to PS (fithian2015bias; watson2019general). A consequence of this is that ‘naive’ estimates of the geographical distribution of a species may be severely biased. Species’ abundance estimates have also been shown to be affected (pennino2019accounting). PS should also be considered in the analysis of environmental data recorded from tagged animals. dinsdale2019modelling demonstrated this with a case study using sea surface temperature recordings from tags attached to Elephant Seals in the Southern Indian ocean. The seals’ preference for cooler waters led to biased ‘naive’ spatial estimates of sea surface temperature.
In environmental statistics, the deleterious impacts of PS have been highlighted. For example, pollution concentration levels throughout and are commonly estimated using noisy observations, , recorded from environmental monitoring networks (shaddick2018data). Here, the locations of the monitors in a network , may have been chosen in a preferential way to meet specified objectives (schumacher1993using). For example, urban air pollution monitoring sites are sometimes used for detecting noncompliance with air quality standards (ozone05; loperfido2008network). In such settings, observations will likely lead to overestimates of the overall levels of the air pollutant, , throughout and . These biased ‘naive’ estimates may then be unsuitable for assessing the impacts of on human health and welfare (lee2015impact).
Previous PS tests have been developed for continuous spatial data, but limitations hinder their general use. Firstly, schlather2004detecting
developed two Monte Carlo tests. Their null hypothesis assumes that the data are a realization of arandom-field model. They assume: the sampled point locations are a realization of a point process on , the recorded values (called marks) of the points are the values of a realisation of a random field on , and are independent processes. Independence here implies a non-preferential sampling mechanism. To detect departures from the null hypothesis, the authors define two characteristics of marked point process, denoted and
. These represent respectively the conditional expectation and conditional variance of a mark, given that there exists another point of the process at a distance. These are chosen since under the null hypothesis E and V should be constant. Monte Carlo tests are used to assess departures of estimates of E and V from a constant function. This approach requires the assumption of Gaussian observations and hence does not generalise to non-continuous marks.
developed an alternative simulation-free test for PS. Instead of fitting a parametric model for the marks, their approach instead divides the region
into non over-lapping subregions. These are assumed to be approximately independent, generating approximately IID replicates of the test statistic. The spatial range ofcan be thought of as representing the inter-point distance required for two observations of to be approximately independent. Finding a suitable set of subregions required for their test may prove a challenge when the spatial range of the correlation of is large relative to the size of . Furthermore, this test requires very large sample sizes; their application used a sample size of over 4000.
For modeling PS directly, it is common to take a model-based approach. Approaches often simultaneously fit a model for the observation process, , with a model for the sampling process, , within a joint-model framework (diggle2010geostatistical). Linear combinations of any spatio-temporal latent effects used to describe are shared across the linear predictors of the two processes. This sharing of latent effects helps to capture any stochastic dependence that may exist between the two processes. A nonzero effect estimate of any of these linear combinations provides evidence that PS is present (watson2018general). Whilst this approach has been successfully applied to mitigate PS, the use of this approach to test for PS may be out of reach of many researchers. These joint models are currently not implemented in many popular software packages, are computationally intensive to fit and can be difficult to design and interpret. Note that design-based approaches have also been introduced for specific scenarios (zidek2014unbiasing). Hereafter, the collection of spatio-temporal latent effects are denoted .
Due to the computational challenges of fitting joint models and the lack of generality of the current PS tests, PS appears to be often overlooked. Researchers may have non-Gaussian data, or have too small a sample size to perform either test. Consequently, without the ability to test for PS, researchers may then fit ‘naive’ models to preferentially sampled data. The potential consequences of PS on their inferences may then be ignored. Fortunately, in many situations, a sufficient set of informative covariates may be available. These can help to control for the PS, without the need to fit joint models. Verifying the the existence of would allow researchers to confidently continue to use their preferred methodologies and packages.
This paper presents a computationally fast method for detecting PS. The algorithm for implementing the test is both intuitive and easy to program. The method primarily requires that the researcher be able to predict the values of , and any latent spatio-temporal effect , throughout and . Any preferred ‘naive’ method can be used. The method is general in that it can test for PS in both the geostatistical and discrete spatial settings, and can be used when the responses (marks) are non-Gaussian and even non-continuous. A general algorithm is provided for all settings. The test can also be adjusted for covariates. This allows researchers to discover a set of relevant covariates that can control the PS.
Qualitatively, PS has a clear appearance in continuous spatial data. PS often appears as a clustering of locations chosen to observe in regions where one or more is either high or low. The test in this paper directly targets this excess clustering. First, a suitable point process is fit to the observed locations to capture the true sampling process under the null hypothesis of no PS. Then, Monte Carlo (MC) realisations of the point process under the null are generated. The magnitude of correlation between the degree of clustering and the estimated values of is computed for both the observed data and the MC realisations. If a stronger correlation is observed in the observed data compared with the MC samples, then evidence for PS has been found. The mean of the K nearest neighbours is our default recommendation to capture the degree of clustering as this quantity may also be used in the discrete spatial setting. In the discrete spatial setting, a Bernoulli sampling process is instead fit to a population of well-defined areal units under the null. A clustering of areal units chosen to observe in regions where is either high or low indicates PS.
The paper is organised as follows. Section 2 introduces the assumed marked point process data generating mechanism for the geostatistical setting. Then, the algorithm and properties of the PS test are described. Section 3 repeats the above for the discrete data setting. Section 4 demonstrates the power of the test to detect PS in a thorough simulation study. The joint effects of the: sample size, spatial smoothness of , spatio-temporal covariates and the magnitude of PS on the power of the test are discussed. Section 5 applies the test to two real datasets previously analysed in the literature. The PS test can be performed using the R package PStestR.
2 Preferential sampling in geostatistical data
In continuous spatio-temporal settings, observations are taken at a set of point locations within the study region at each time step . Standard approaches for modeling from a set of observations include variogram analysis and kriging-based methods (digglemodel). These methods fall under the umbrella term of “geostatistical methods” and require the assumption that the locations chosen to observe the process were not preferentially sampled (diggle2010geostatistical).
For the modeling of point patterns in space and time, spatio-temporal point processes are the standard statistical toolbox (illian2008statistical; baddeley2015spatial). This class of models will be used throughout our paper to explain the observed point patterns through time. Standard ‘naive’ geostatistical methods require the assumption that the sampling process generating the sampled locations be independent of the underlying spatio-temporal field . This assumption implies no PS and simplifies the analysis greatly. Here, the point pattern and the marks may be investigated separately using standard techniques. However, when this assumption is violated, the two processes must be considered together. Marked spatio-temporal point processes should be considered as a formal framework for such a data analysis (schlather2004detecting).
The PS test we are about to describe requires the following three assumptions. The final assumption describes the assumed characteristic behaviour of the PS.
The PS is driven by some or all of the spatio-temporal latent effects .
All latent effects driving the PS are spatially ‘smooth enough’ relative to both the size of the study region and the number of locations chosen to sample the process .
The density of points within at space-time point depends monotonically on the values of the components of driving the PS.
Assumptions 2.1 - 2.3 imply that preferentially sampled data will appear as point patterns that are clustered in space for each . These clusters will focus around regions where relevant elements of are especially high or low, depending on the direction of PS. The first inferential goal becomes the detection of monotonic associations between the degree of clustering throughout with the values of relevant . If PS is detected, then the second inferential objective becomes the determination of whether or not clustering can be explained by a set of informative covariates . That objective is achieved if such a set removes all of the PS-associations.
The ranked nearest neighbour distances between the sampling locations is proposed as a default choice to measure the magnitude of clustering. Following the recommendations of gignoux1999comparing, edge-corrections for these distances are not considered within the Monte Carlo algorithm. Many other quantities can be chosen to capture local clustering and may be more suited for specific generating mechanisms. The PS test developed in this paper can easily be modified to use another quantity. The ranked nearest neighbour quantity is chosen for its generalisability across both discrete and continuous spatial settings.
2.1 Assumed model for preferential sampling
Many spatio-temporal point processes have been developed, with each possessing fundamentally different properties. An appropriate choice for a given analysis depends upon the sampling protocols that generated . For example, Gibbs point processes allow for second-order effects such as inter-point attraction and repulsion to exist between points. A limiting case is seen in the Hard Core process. This process does not allow for points to exist within a distance , called the ‘range of interaction’. Cluster processes provide a class of point processes that describe the locations of ‘parent’ points with a separate process from their ‘daughter’ points (baddeley2015spatial). Many more processes exist and may prove useful in applications.
The simplest class of spatio-temporally varying point processes is the inhomogeneous Poisson process (IPP hereafter) (illian2008statistical). The IPP is completely defined by its intensity function . This is defined as the expected number of points per unit area and time immediately around . Let . Define two disjoint space-time cubes . Then the numbers of points that fall within the two space-time cubes
Gaussian processes can be added to any linear predictor used to model the natural logarithm of . then becomes a log-Gaussian random field and the process becomes known as a log-Gaussian Cox process (LGCP hereafter) (simpson2016going). LGCP models are especially useful for modelling point patterns when residual spatio-temporal correlations are expected to remain in the intensity, even after including any available covariates. In these cases, the are given spatio-temporal correlation structures. The are then referred to as spatio-temporal Gaussian random fields.
A base LGCP model is now introduced for describing the sampling process of in many geostatistical settings. This model is very general. To ease the notational burden, only one latent effect is considered and denoted . This constraint can be relaxed. Note that when the sampling protocols generating clearly deviate from the LGCP assumption seen in (1), other point processes should be considered. Details of other processes are found in baddeley2015spatial and illian2008statistical.
With a slight change of notation, removing the subscripts to improve readability, let denote the observation process at location and time . This may be of any type (e.g continuous, count, binary etc.). Let denote the target spatio-temporal process and let denote the spatio-temporal latent Gaussian random field. As before, let denote the collection of sampled points at time . The following data generating mechanism is now assumed:
Square brackets denote random variables. In equation (2),
represents the conditional probability distribution, given the target latent spatio-temporal effect , and given the location was sampled at time (i.e. ). Values of are missing at all non-sampled locations. The link function describes the relationship between the linear predictor and the target spatio-temporal process . Thus, the model contains the popular class of generalised linear geostatistical models (digglemodel). The regression equation for is specified in (4), with fixed covariates .
In equation (3), the sampling process is modeled as a LGCP with intensity function (5). Unique fixed covariates and shared covariates both describe the intensity. The shared covariates are transformed by functions that may be nonlinear. Note that, conditional on , is assumed to be a realisation from an IPP. The function allows for nonlinear transformations of the spatio-temporal Gaussian process to be included in (5). This specifies the nature of PS.
The PS test developed in our paper is highly general. When is strictly monotonic, the primary goal of the test is to detect the monotonicity of . The precise form of does not require specification. Suppose and at least one element of is non-zero. The subset of covariates
corresponding to these nonzero elements provide a sufficient set of informative covariates required to control for PS. The second goal of the test is to correctly identify this subset of covariates. Note that the covariance matrix of the vector ofvalues evaluated at is denoted . This also requires estimation. Finally,
are hyperparameters to be estimated in the model. Bothand may be estimated using a maximum likelihood approach, or, given prior distributions, and then estimated under a Bayesian approach.
Including a sufficient set of informative covariates in a model for should help to improve the prediction of across and , by reducing the deleterious impacts of PS on spatial prediction. However, it must be stressed that this approach is not a silver bullet; if the data () were instead collected via a complete spatial randomness sampling design, then the predictive accuracy of the fitted model would likely be improved (gelfand2012effect). Thus, these methods should be viewed only as a partial remedy for badly sampled data rather than as a justification for ignoring the need for good spatial design of networks and surveys.
Finally, the conditional likelihood of the LGCP given is:
with being the area of the domain and being the length of the time set.
2.2 Monte Carlo algorithm
Assume the above data generating mechanism. A Monte Carlo algorithm is now designed for testing the null hypothesis that , versus the alternative hypothesis that is a monotonic function of . Under the null hypothesis , the observation and sampling processes are conditionally independent given . Thus, given , no associations are expected to exist between computable quantities from the fitted (null) IPP and estimates of .
Conversely, suppose that the null hypothesis is false. Specifically, let be a monotonic increasing function of . Point patterns from this data generating mechanism are expected to exhibit an excess of clustering in regions of high , relative to that explained by the null model. This phenomenon is referred to as positive PS. Here, a positive association between the localised amount of clustering and estimated values would be expected. The converse holds when is a decreasing function.
The primary challenge is defining what constitutes a ‘strong’ association between estimates of and the computed quantities used to capture excess localised clustering. Positive spatio-temporal correlations are present in due to
. This leads to non-standard sampling distributions for test statistics computed to capture association. Standard hypothesis tests of association (e.g. t-tests, rank-correlation tests etc.,) will have a type 1 error above the specified level due to the positive correlations.
This is why Monte Carlo methods are used. An empirical p-value associated with any desired test statistic can be computed by sampling realisations from the assumed IPP under the null hypothesis (i.e. fixing ). The application generalises to any given dataset. Crucially, this procedure accounts for the nonstandard sampling distribution of the chosen test statistic in a natural way. The mean of the nearest neighbour distances from each observed point is our default choice of computable quantity. Small values of this quantity within a region, indicates the presence of clustering there. When , this reduces to the nearest neighbour distance. For the default choice of test statistic, the Spearman’s rank correlation coefficient between estimates of at locations and the mean nearest neighbour distances is proposed. Unlike Pearson’s correlation coefficient, Spearman’s rank correlation measures the degree of monotonicity of , instead of the degree of linearity.
We now state the probability distribution function of the (spatial) distance from any point of to its nearest neighbouring point for the IPP data model. Let ( define a reference space time point and let be the time interval of interest respectively. Next, define as the ball of radius centered at s. Let once again denote the intensity function for the assumed IPP model under the null hypothesis . The following assumption is made:
Under Assumption 2.4 and the above data generating mechanism, the probability that the nearest point of from , lies within a spatial distance , at some time is:
Equation (8) gives us the following intuitive result for IPPs. When , the expected nearest neighbour distances are lower in regions of high intensity (i.e where is high). This is to be expected - the intensity function at precisely defines the expected density of points immediately around .
The result has also been derived when , under the alternative LGCP model. coeurjolly2017palm derived the Palm distribution for LGCPs. The authors showed the remarkable result that conditional on a single point in lying at location , the remaining points of are also a LGCP. The second order joint intensity function describing this process can be thought of as representing the intensity, conditional on . Using the authors’ result, this remains a log-Gaussian random field, with the conditional process differing only in the mean.
, with a positive monotonic function.
The covariance is strictly non-negative, i.e.
is stationary and isotropic:
The correlation function decays monotonically as the distance from the conditioning point s increases: .
The correlation function decays monotonically as the distance from the conditioning time increases: .
Under Assumption 2.5 and the above data generating mechanism, the second order joint intensity function at ) is:
Then, using the law of total expectation, the probability that the nearest point from , lies within distance , at some time is:
Thus, within the linear predictor, the only change from the original LGCP intensity (5) is the addition of the term . This is the covariance function between the conditioning point and the space-time point . Under Assumptions 2.5 - 2.9, the intensity immediately around will always be higher. Contrast this with the IPP. Here, the knowledge of a point of existing at does not affect the intensity immediately around . Note that the conditional intensity of the point process monotonically increases with and .
Assumptions 2.6 - 2.9 are commonly made in practice. They imply that the latent process will be expected to be more similar at two space-time locations that are ‘close together’ than two space-time locations that are ‘far apart’. Popular choices of correlation functions include the Matern correlation function across space (digglemodel) and autoregressive correlation functions across time. Spatio-temporal correlation functions are often defined to be the products of these spatial and temporal functions for computational simplicity (blangiardo2015spatial). Under these models, the correlation often becomes negligible at spatial (temporal) distances greater than some value, often called the spatial (temporal) range.
Under Assumptions 2.5 - 2.9, equation (10) now helps to explain the suitability of our choice of nearest neighbour distance to capture the excess clustering. Firstly, suppose the conditioning point is in a region where the latent effect is above average (i.e. ). Here, we see that the expected nearest neighbour distance from decreases monotonically as: i) increases and ii) the correlation function increases. Condition ii) implies that the nearest neighbour distances will decrease as the size of the spatial and temporal ranges increase. Note that the converse results hold when is in a region of low .
Thus we have shown the result we wanted. Under Assumptions 2.6 - 2.9, with monotonic in either direction, a monotonic association is expected between the nearest neighbour distances between the observed points and the values of at . This monotonic association should be captured with our rank correlation test statistic when . Conversely, under Assumption 2.4, no association is expected, so long as the fitted null IPP is correctly specified. In this case, the observed test statistic should be no more extreme than the Monte Carlo realisations.
To define the test requires some additional notation. Let be a finite set of time intervals. Let denote the observed number of points at time . Let denote the set of K nearest indices from each point . Let denote the estimate of . Define the superscript above each of these quantities as the index of the Monte Carlo sample . Thus denotes the set of K nearest indices from point in the Monte Carlo sample . Finally, let denote the mean of the distances to the nearest points from point at time in the original data and the Monte Carlo sampled point pattern respectively. Thus:
The terms and are defined to be the vectors of length and containing the values of and respectively. When calculating (9) in the original dataset, simply drop the superscripts.
The Monte Carlo algorithm, referred to as the test hereafter, is defined above in Algorithm 1. It can now be summarised as follows. First, fit the assumed models (2) and (4) for both and . Next, estimate throughout and compute the averaged K nearest neighbour distances . Using the estimates at and , compute the Spearman’s rank correlation coefficient between them for each . This is the observed test statistic.
Next, sample realisations, , from the fitted point process model (4). For each of the realisations, repeat the procedure. Compute the distances and estimate the values at to obtain . Finally, compute the desired empirical p-value. For the pointwise tests, simply evaluate the proportion of the Monte Carlo-sampled that are more extreme than . For Monte Carlo envelope tests that do not suffer from the problems of multiple testing, refer to mrkvivcka2017multiple; myllymaki2017global.
The values of the latent field are not known and must be estimated. The power of the test to detect PS, may depend upon the suitability of the method used to produce estimates . Likelihood-based approaches for fitting the above observation model (i.e. components (1), (3) and (5)) have been shown to have many nice properties. Asymptotic consistency has been proven under the null hypothesis that for certain choices of . These include Bernoulli (ghosal2006posterior) and Gaussian (choi2007posterior) likelihoods. Such results help to justify the suitability of the method. Furthermore, under the null hypothesis
, some ‘naive’ approaches even produce unbiased estimates of. For example, under the above data generating mechanism, with
the normal distribution andthe identity function, the Gaussian process regression is the best linear unbiased predictor for (cressie1992statistics).
Other choices of a computable quantity for capturing spatial clustering can be made. These may be more suitable for certain data generating mechanisms of . However, few choices are as generalisable across both continuous and discrete spatial data. For continuous spatial data, smoothed estimates of the residual measure from the fitted (null) point process, evaluated at the points may be suitable. However, this depends upon two tuning parameters: the details of the discretisation method chosen to approximate the likelihood and the choice of the bandwidth used to smooth the estimated values.
The nearest neighbour method does not suffer these drawbacks and has many desirable properties. It generalises across both continuous and discrete spatial settings. Secondly, different choices of can lead to improved powers to detect PS under different sampling processes. For example, when the spatial scales of clusters within are very small, and hence when clustering is very localised to only a few points per cluster, smaller may lead to improvements in power. This is because larger values of may ‘smooth over’ any clustering. Conversely, when clusters are large in spatial scale, with each cluster being comprised of several points, the power may be improved with larger choices of . Here, the additional smoothing can reduce the variance of the computed test statistic. The final benefit of the nearest neighbour quantity is that the distances can be computed exactly, with values not dependent upon any choice of computational approximation.
In some applications it may be suitable to fix the sample size across the Monte Carlo samples (i.e. ). For example, regulatory standards may dictate the required number of monitoring sites . The assumption of conditional independence between the sampled locations under the null IPP makes this especially easy.
Strictly speaking, since in practice the values of and the parameters in (5) are not known and are only estimates, the plug-in test of Algorithm 1 will be invalid. This is because the null hypothesis is composite (baddeley2017two). However, tests which ignore the effects of parameter estimation will tend to be conservative in most cases. A loss of power is typically the price to pay (dao2014monte). Whilst the method introduced by dao2014monte can be used to ensure the test attains nominal type 1 error, the required nested Monte Carlo simulations dramatically slows down the implementation. In the simulation studies of Section 4, the test defined in Algorithm 1 is found to be conservative across all tested simulation settings. Thus we do not consider this matter further.
P-values have come under increasing criticism recently (see wasserstein2016asa and references within). Indeed, the computation of an empirical p-value alone to identify the binary presence/absence of PS within a dataset has its flaws. For example, it does not help to quantify the potential magnitude of the biasing effects that the PS may have on the spatial prediction of . Furthermore, as with all p-values, it is easy to fall victim to the p-value fallacy. A given p-value does not provide much information on its own. A value close to 0.05 neither provides strong evidence in favour of the alternative hypothesis vs. the null hypothesis, nor implies that the frequentist error probability is close to 0.05 (sellke2001calibration). Additional steps must be taken to make such inferences, such as the use of the calibrations introduced by sellke2001calibration. In summary, any reported p-values from the tests outlined in algorithms 1 and 2 must be used with care.
Assuming the correct data generating mechanism is specified and the true parameters are known, the test will be exact regardless of how small is chosen. For testing at the 5% significance level, could be chosen as low as 19. However, this comes at a cost of power, with the loss of power proportional to (davidson2000bootstrap). Furthermore, a small
implies a high standard error of the empirical p-value. This leads to a test whose outcome is heavily dependent on the precise sequence of random numbers used to implement the algorithm. To alleviate these concerns,should be chosen as large as is computationally feasible.
3 Preferential sampling in the discrete spatial data setting
In the discrete spatial data setting, observations are taken across a set of areal units within the study region . Examples of areal units include electoral districts and large survey transects. The sizes of these areal units may be irregular, and are assumed known. It is also assumed that the full population of all possible areal units that were available for sampling at each is known. This population is denoted . A binary process is fit to emulate the true sampling process. The choice between a Bernoulli and Binomial model depends on whether or not the constraint is implemented. Once again, a Monte Carlo approach is taken for testing for PS.
3.1 Assumed model for preferential sampling
Given the population of areal units available at time , denoted , define the site-selection indicator variables as follows. Let
denote the indicator random variable that theareal unit in is selected at time . Then the collection of sampled areal units at each time , is simply the subset of the population of areal unit whose indicator variables take value 1 (i.e. ).
Next, define to be the fixed spatio-temporal covariates for the indicator selection process at areal unit A. These will typically be areal-aggregate or areal-count values. Similarly, and will typically be areal-aggregates of the underlying spatio-temporal process and the spatio-temporal covariates respectively. In applications, will typically be modeled as a discrete spatio-temporal process on the areal unit scale instead of as a continuous process. Examples include the conditional autoregressive process and its spatio-temporal extensions (besag1974spatial; blangiardo2015spatial).
The same model form is assumed for the observation process in (1). The only change made is the spatial scale. Thus, the new sampling process is defined as:
For each time step , it is assumed that each of the areal units within the population has values sampled or not according to the outcomes of the independent Bernoulli trials defined in (12).
3.2 Monte Carlo algorithm
All of the same assumptions and issues outlined earlier carry over to the discrete spatial setting. Once again, must be spatially smooth across the areal units and estimates of must be available at each of the areal units in the population at each time . Nearest neighbour distances between areal units within can once again be used. Such distances can be defined relative to the areal unit-centroids or otherwise. Strictly speaking, the PS is no longer seen as a clustering of the point pattern around high (or low) values of . Instead, the PS takes the form of a clustering of the areal units with complete data around high (or low) values of . The procedure is defined in Algorithm 2 above.
4 Simulation Study
This section summarises the key results of an investigation into the performance of the test. The power of the test is demonstrated across a range of simulated data settings. A more thorough treatment of the simulation study is provided in the supplementary material. All computations involving point processes were performed using the spatstat package (baddeley2015spatial).
The following data generating mechanism is chosen for the Gaussian response simulation study:
The simulated data are in the purely spatial setting (i.e. —T— = 1), with the specified as noise-free observations of . is a realisation of a mean-zero Gaussian process with Matern covariance matrix . The Matern roughness parameter
is set to 1 and the standard deviation ofis fixed at 1. The spatial range of the process is adjusted. The spatial range is defined here to be the distance at which the spatial correlation drops below 0.1. A larger implies the process has a greater spatial smoothness (i.e. a lower frequency).
The sampled locations are generated from a LGCP with a single covariate . The number of points is fixed equal to , and thus the true process is a Binomial point process. The parameter determines the magnitude of PS, with corresponding to the null IPP model of no PS. Again, is an independent realisation of another Gaussian process with Matern covariance function. Both the roughness and the standard deviation are again fixed at 1, but the range parameter is varied independently from . The values of are assumed known throughout . The parameter determines the effect of the covariate on the intensity .
The test is performed at the 5% significance level using 19 Monte Carlo samples (i.e. ). implies that these results provide a lower bound on the power of the test (davidson2000bootstrap); a higher power would have been attained had a larger value been chosen. All tests are performed with the two-sided alternative hypothesis that is a monotonic function of . Each experimental setting is repeated 200 times. In the study, all combinations of the following parameters are evaluated:
Spatial range of ,
Spatial range of ,
Number of nearest neighbour distances
Along with the test outlined in Algorithm 1, a Monte Carlo test using estimates of the raw residuals of the assumed IPP under the null hypothesis is also computed. This time the rank correlation between the estimates of and the estimated residuals is the test statistic. As before, both sets of estimates are only evaluated at the point locations. The estimated residual values are simply kernel smoothed raw-residuals. An edge-corrected Gaussian kernel is used with bandwidth selected using leave-one-out cross-validation. We refer to this test as the residual test hereafter. It is interesting to assess the relative performance of the test, given its generality across all point processes and to the discrete spatial setting. We now summarise the results of the study.
First, our investigation strongly suggests that the type 1 error is bounded above by the nominal level. This is seen across all simulation settings when the null hypothesis is true. Thus, it appears that the computationally costly nested Monte Carlo approach of dao2014monte need not be used, except in the interest of improving the power of the test. Fig. 1 shows the results for in the simplest setting without covariate effects (i.e. ) or PS effects (i.e. ). The spatial range is changed and the test is performed across different value of . It is apparent that both Monte Carlo tests attain Type 1 error at or below the 5% level. For comparison, the two standard simulation-free rank correlation tests attain a Type 1 error well above the 5% level. The error of these tests increases dramatically with because they ignore the spatial correlation and hence the non-standard sampling distribution of the test statistic. We omit the simulation-free results hereafter.
Next, the power is assessed. Across all the simulation settings, the power improved with increasing spatial range . This is in agreement with the earlier results (9) and (10). must be spatially-smooth to achieve high power. Furthermore, the power of the test is found to be sensitive to the choice of value. Optimal choice of depends upon both the spatial range of and the sample size. Larger values of both implies that a larger value of should be chosen. Fig. 2 shows the results for the setting where no covariate effects exist (i.e. ), but where moderate positive preferential sampling occurs (i.e. ). For , the test has a slightly lower power than the residual test. This difference diminishes as the sample size increases. Conversely, the test outperforms when both the spatial range is very small () and when the magnitude of PS is very high (). Fig. 7 in the supplementary material demonstrates this. Under these conditions, very small clusters form. However, a different choice of bandwidth-selection method may improve the power of the residual test.
Strong covariate effects in the sampling process hurts the power of all the tests. This can be seen in Fig 8 in the supplementary material. Interestingly, the test is shown to be competitive across all settings tested, except one. When the spatial ranges of both the covariate and the field are large and similar, the power of the test is very low. Here, the residual test, with residuals computed from the fitted IPP, performs much better. The power is almost doubled that of the test. In these settings, despite the and arising from independent distributions, the empirical correlations between them in any given realisation may be large. Consequently, the test may be unable to distinguish between the clustering due to , and the clustering due to the measured covariate . This is because the test, unlike the residual test, does not directly adjust for . However, the IPP residual test is not always superior. When the is very low, the test has higher power when .
The performance of the tests are also assessed in settings where the response is non-Gaussian, and when the true sampling process is not an IPP. (14) is replaced with a Poisson distribution and the true sampling process is set equal to a Hardcore process. Different radii of interactions are compared. The Hardcore process is purposefully chosen. Here, the use of nearest-neighbour distances to capture additional clustering is poor. Since the nearest neighbour distances are lower bounded, the contrast in their observed values will decrease as the radius of interaction increases. Conversely, estimates of the smoothed residuals will not be directly affected. Figure 11 in the supplementary material clearly shows that the test based on the smoothed residuals far outperforms the test when the radius of interaction is high. This demonstrates a clear need for the researcher to choose a measure of clustering that is suitable for the true sampling process. The power remains high for Poisson .
5 Case Studies
The ability of the test to detect PS in two real case studies is now demonstrated. These two datasets are chosen since the presence of PS within them has previously been shown in published work. In the first example, it is shown how researchers can easily detect positive PS and then search for a sufficient set of informative covariates using the test. In the latter example, negative PS is detected.
5.1 Great Britain’s black smoke monitoring network
Annual concentrations of black smoke were obtained from the UK National Air Quality Information Archive (airquality.co.uk). Site locations and annual concentrations average concentrations of black smoke () were obtained from the monitoring sites. Previous analyses demonstrated that the network had been preferentially sampled, with a positive monotonic function across the years of 1966-1996 (watson2018general; shaddick2014case).
The analysis is restricted to the 1966 concentrations. For reasons outlined in (shaddick2014case), only sites that gave readings for at least 273 days of the year (75% data capture) are considered. Fig. 3 shows that it is apparent that sites were located in the industrial cities around London, the Midlands and the North West of England, with almost no sites present in the industry-free Scottish Highlands. Thus the point pattern displays clear evidence of clustering around regions expected to have high concentrations black smoke.
The R-INLA package with the SPDE approach is used to fit a standard geostatistical model (rue2009approximate; lindgren2011explicit; lindgren2015bayesian). Other methodologies, Bayesian or frequentist, could be used. As in watson2018general the log-ratios of the concentrations are the choice of response. PC priors are placed on the approximate Matern field (fuglstad2018constructing)
. A prior probability that the spatial range is below 5km is set to 0.1, and a prior probability that the standard deviation of the field is above 3 is set as 0.1. The (mean-centered) posterior means of the log-transformed black smoke levels acrosswere then used as the ‘naive’ values.
Gridded residential human population count data with a spatial resolution of 1 km x 1 km were obtained for Great Britain. This was based on 2011 Census data and 2015 Land Cover Map data from the Natural Environment Research Council Centre for Ecology & Hydrology (UKgriddedpopdata). As in watson2018general, it is assumed that the relative population density across Great Britain remained approximately stable from 1966-2011. This is used as both a covariate for the null IPP sampling process and as a covariate for the observation process. Initially, the presence of PS is tested for without the population density covariate included in either process. This test is referred to as V1. Next, population density is controlled for as an informative covariate in both the observation and the null IPP sampling processes (it is found to be strongly associated with both). It is then investigated if PS remains. This test is referred to as V2.
Estimates of in the V2 test are thus corrected for population density. Population density is included as a Bayesian spline to capture any nonlinear effects of population density on the observed black smoke . Mechanistically, including population density as an informative covariate in a model for black smoke concentration is reasonable. Localised sources of black smoke include the combustion of carbon-based fuels, with expected levels of combustion expected to increase with population density. If PS is no longer detected after this adjustment, then the population density has explained away the PS. Conversely, if PS is still detected, then standard ‘naive’ methods may be biased even after controlling for population density.
|HPP P value V1||0.28||0.01||0.01||0.01||0.00||0.00||0.01||0.01|
|IPP P value V2||0.20||0.01||0.01||0.01||0.00||0.00||0.00||0.01|
Table 1 shows the empirical pointwise p-values of the tests with changing , under both of the assumed sampling mechanisms. Results are shown for both the and tests. The empirical p-values are the proportion of rank correlations in the 1000 Monte Carlo samples, that were more negative than observed in the data. Thus this is a 1-sided test. For , strong evidence is found against the null in favour of a positive monotonic under the HPP test. This remains under the IPP sampling process. Note that the p-values have not been adjusted for multiple testing.
The insights gained from these tests are as follows. A ‘naive’ model fit to the black smoke data without adjusting for population density may be biased due to PS being present. Whilst population density may explain some of the observed PS seen in the data, residual PS still remains after controlling for population density in a model for black smoke (IPP V2 result). Thus a sufficient set of PS-removing covariates has not yet been identified. Either this iterative process of finding a sufficient set of covariates must be continued, or a joint model should be considered as in watson2018general.
5.2 Galicia lead concentrations
The second real world dataset consists of the concentrations of lead in moss samples collected across Galicia, northern Spain, in 1997 (fernandez2000extended). The concentration is measured in micrograms per gram of dry moss. The 1997 locations were previously shown to have been preferentially sampled in the landmark preferential sampling paper by diggle2010geostatistical. In fact, in this example, a significant negative linear effect was found. Thus was found to be negative monotonic.
Fig. 4 shows a clear increase in the density of sampled locations in the northern half of Galicia. This half was found to have the lowest concentrations of lead. PC priors were again placed on the approximate Matern field. A prior probability that the spatial range is below 10km was set to 0.1, and the prior probability that the standard deviation of the field is above 3 was set to 0.1. The posterior mean log-transformed lead concentrations were used as the values. Empirical p-values were computed using 1000 Monte Carlo samples. The direction of the inequality was reversed this time, to test for negative PS.
The empirical pointwise p-values of this test are shown in Table 2. Even after considering Monte Carlo error, moderately strong evidence of negative PS exists, with the strongest evidence occurring at . A researcher would now have to decide whether to pursue a sufficient set of covariates, or fit a joint model as in diggle2010geostatistical.
6 Concluding Remarks
A fast and intuitive test has been presented for detecting preferential sampling (PS) in both geostatistical and discrete spatial data settings. The test is highly general; any preferred methodology for estimating a latent spatio-temporal process may be used. This includes both Bayesian and frequentist methods. In many situations, detected PS may be adequately described by a set of available informative covariates that are associated with both the sampling process and the observation process being measured. The test presented in this paper is able to identify such an informative covariate set. In cases where residual PS remains, even after controlling for a set of informative covariates, researchers can either seek a sufficient set of informative covariates, or seek a method that directly models the PS (e.g. a joint model).
In this paper the properties and validity of the test were demonstrated through an extensive simulation study. The suitability of the test for real-world applications was then confirmed through the re-analysis of two previously published case studies. Both case studies had previously detected PS, and the test successfully replicated the findings of both. The power of the test to detect PS was shown to increase with: the spatial range (i.e. the inter-point distance required for observations to be approximately independent), the sample size, and the degree of PS. The power decreased dramatically with the inclusion of covariates, independent from the observation process, that had a strong effect on the sampling process.
The biasing effects of PS on spatial prediction has been clearly demonstrated across a wide range of fields and can be severe in magnitude (diggle2010geostatistical; watson2018general; pennino2019accounting; dinsdale2019modelling). Thus, PS should not be ignored in spatial analyses. The test proposed in this paper has been shown to be suitable for assessing the presence of PS in most spatio-temporal analyses. Therefore, in cases where the sampling protocol is either unknown or known to be preferential, reporting the results from a PS test alongside any publication of spatio-temporal analyses should become standard practice. A user-friendly R package PStestR is available for implementing the algorithm. No extra work or computation time is required. The package works seamlessly with many of the commonly used data types (e.g. from the sp, sf and spatstat libraries (sp1; sp2; sf; baddeley2015spatial)) and can perform both pointwise and rank envelope tests, to alleviate the multiple testing problem.
Three avenues of research should be pursued in the future. First, how to best capture the localised clustering in specific sampling settings should be explored. For example, in this paper it was shown that the nearest neighbour distance may be a poor choice for measuring the degree of clustering under certain sampling processes. How to optimise the power of the test in specific settings should be pursued. Next, eliminating the need for generating Monte Carlo samples from the null point process may be possible (acosta2018effective). We found that ‘effective-sample size’-adjusted rank correlation tests showed very poor performance. Convergence rates of the test could be as low as 10% for specific simulation settings. Thus, we omitted the results. Further investigation here is warranted. Finally, adjustments are required for using this test in spatio-temporal applications where sampling locations are retained from one time step to the next. Environmental monitoring networks are an example. Here, the chosen network locations from one time step to the next are not independent. Additional work should be pursued to generalise the methods in this paper to such settings.
The author would like to extend his gratitude to Professor James V. Zidek FRSC, O.C. for his lively discussion and insightful feedback throughout.
7 Supplementary material
7.1 More details on the simulation study
For computational speed-ups, the R-INLA package is used for both the simulation and estimation of [rue2009approximate, lindgren2011explicit, lindgren2015bayesian]. A high resolution triangulation mesh (triangle lengths of 0.01) is defined for the SPDE approximation over the unit squares within . For the priors on the Gaussian process, pc priors [fuglstad2018constructing] are specified with prior probability of 0.1 that the spatial range is less than 0.1 and prior probability of 0.1 that the standard deviation is greater than 3. Gaussian errors on the responses are added, with a weakly informative Gamma(1, 5e) distribution placed on the precision of the error distribution. This is done to reduce the risk of computational singularities. The test is performed at the 5% significance level using 19 Monte Carlo samples (i.e. ). Each experimental setting is repeated 200 times.
Along with the test outlined in algorithm 1, a Monte Carlo test using the rank correlations between estimates of and estimates of the raw residuals of the assumed IPP under the null hypothesis is also compared. This may provide a more suitable test, when the assumed sampling mechanism is indeed a LGCP. Furthermore, such a test does not require a choice of . To compute these residual values, an edge-corrected Gaussian kernel is used to smooth the raw residuals. The bandwidth is selected using leave-one-out cross-validation. This is performed using the spatstat package [baddeley2015spatial]. To compute the test, the values are simply replaced with the smoothed residual values, evaluated at the point locations. We refer to this test as the residual test hereafter. It is interesting to assess the relative performance of the test, given its generality across all point processes and to the discrete spatial setting.
First, the Type 1 error of the PS tests are assessed in the simplest setting without any covariate effects (i.e. ) or PS effects (i.e. ). Results from four tests are compared. The first two are residual tests. The first computes the p-value directly using a standard permutation approach under the (false) assumption that the pairs of residuals and estimates are an IID sample from some bivariate distribution. The positive spatial correlations due to the process violate this assumption, with the magnitude of violation increasing with the spatial range . The second attempts to correct for this spatial correlation. By forming realisations from the estimated sampling process, the spatial correlation in is accounted for. The third and fourth are tests. Once again, comparisons are made between the permutation-based and the Monte Carlo-based approaches.
Fig. 5 shows the results for across three increasing values of the spatial range and across different numbers of nearest neighbours . It is apparent that both Monte Carlo tests attain Type 1 error at or below the 5% level. The two standard permutation tests attain a Type 1 error above the 5% level, and this increases dramatically with . At the highest value of , equal to the length of the domain , the Type 1 error can be higher than 40%. The results for the very low value of , demonstrate that the type 1 error approaches the nominal 5% level when the spatial correlation approaches zero. This is due to the IID assumption becoming more reasonable as the
tends towards Gaussian white noise. When, the prior distribution on the range parameter would reflect the case where a researcher incorrectly assumed spatially smooth data prior to the model-fitting. Fig. 9 shows the results for . It is apparent that the Type 1 error increases with sample size for the permutation tests, while the Monte Carlo tests remain bounded above by 0.05.
Next, the power of the two Monte Carlo tests to detect a PS effect when the alternative hypothesis is true is assessed. The behaviours of the tests are first investigated in the setting where no covariate effects exist (i.e. ), but where moderate positive preferential sampling occurrs (i.e. ). All tests are performed with the two-sided alternative hypothesis, namely that is a monotonic function of in either direction.
Fig. 6 shows the results for , this time with spatial ranges of , again across . The power results for are omitted in Figure 7, since the power is consistently small (¡0.1) for both. This demonstrates the need for the term to be spatially-smooth for the test to detect PS. It is clear that the power of the test is sensitive to the choice of value, especially for smaller sample sizes. Interestingly the optimum power achieved by the test with respect to depends upon both the spatial range of and the sample size. The higher the spatial range of , and hence the smoother it is, the greater the value of that is required to optimise the power. The optimum choice of also increases with the sample size, since the number of realised points per cluster increases. For example, when and the sample size is 50, the test is optimised when . This increases to when and the sample size is 100. Finally, for it appears that the tests have a slightly lower power than the residual measure-based test.
Next, the spatial range is fixed to be very small ( = 0.02), and the magnitude of PS is fixed to be very high (). This set-up leads to very small clusters to form when . The joint effects of sample size and on the power of the test to detect PS is then demonstrated. Additionally, the power of the test is compared to the residual test. Three plots are shown to present the power vs. in Fig. 7. From left to right, these show sample sizes of and . For small sample sizes (), both tests have low power to detect PS as expected. Interestingly however, the test outperforms the smoothed residual test for all three sample sizes, achieving maximum powers of 0.21, 0.65 and 1 at compared with 0.14, 0.40 and 0.97 for the residual test. Furthermore, the power of the test attains it maximum at , before dramatically diminishing to 0 as increases. Fig. 10 shows the equivalent plots for . Here, the test is no longer more powerful, with the residual test performing better at . Note that the performance of the residual test may improve with a different choice of bandwidth-selection method.
Results are now presented for the case when a unique covariate effect exists for the sampling process. The magnitudes of the covariate effect and the PS effect are both set to 1 (thus ). The spatial range of the covariate effect is varied (). The results on the power of three different tests to detect PS are shown. As before, the first two are the kernel-smoothed residual and rank tests. The third test is a rank test using kernel-smoothed estimated residuals, but this time using residuals computed from an incorrectly specified point process fit to the points. This is chosen to be a homogeneous Poisson process (HPP hereafter). Note that the Monte Carlo realised points still come from the null IPP, fitted to the original observations . Thus the Monte Carlo sampled realisations still come from the correct data generating mechanism (correct up to parameter estimation error). Unlike the residuals from the first test, these residuals do not adjust for the covariate effect. The purpose of this comparison is to see if any improvements in the power of the test can be attained by considering computed quantities that directly adjust for any covariate effects.
The spatial range of the covariate field is changed for the following reason. When the spatial ranges of both the covariate field and the underlying spatial field are large and similar, then the magnitude of the empirical correlation of a single realisation of the two fields may be high. This is despite their realisations arising from independent distributions [hanks2015restricted]. A possible consequence of this is that the tests may be unable to distinguish between clustering due to an unknown process , and clustering due to the measured covariate . This may affect the ability of tests to detect preferential sampling, when their computed quantities are not adjusted for the effects of covariates. The rank test of the residuals from the correctly specified IPP model is the only test that directly adjusts for the covariate effects.
Fig 8 presents a complex interaction of different factors. When the covariate field has very low spatial range (i.e. ), and hence has high frequency, negligible correlation can exist between and . Consequently, no gains in power are seen when tests use covariate-adjusted measures of clustering relative to when they use unadjusted measures. However, when both the covariate and underlying field are very smooth (i.e. ), large increases in power are seen with the covariate-adjusted measure of clustering. The power increases to 0.57 compared with 0.40 and 0.33 for the HPP residual and the methods respectively. The results are similar for (see Fig. 12 in the supplementary material). In conclusion then, in cases where the spatial ranges of informative covariates are large and similar in size to the underlying , computed quantities other than should be considered to improve the power to detect PS.
Finally, the performance of the tests are assessed in settings where the response is non-Gaussian, and when the true sampling process is not an IPP. The values now take the form of counts and (13) is replaced with a Poisson distribution. The log-transformed mean at location is set equal to the random field plus a constant intercept of 2. The intercept is chosen to ensure non-zero counts occur often. The true sampling process is set equal to a Hardcore process with two different radii of interactions, denoted , compared (0.025 and 0.05). Under these two processes, points within cannot be sampled closer than a distance apart of 0.025 or 0.05 respectively. With , these two constraints enforce moderate and strict levels of regularization of the points respectively, violating the IPP assumption of no inter-point interaction.
The Hardcore process is chosen to highlight the fact that the choice of nearest-neighbour distances to capture additional clustering will be poor in some settings. Here, due to the nearest neighbour distances being lower bounded, the contrast in their observed values will decrease as is increased. Estimates of the smoothed residuals are not directly affected by an increase in and hence the residual test is expected to far outperform the NN test as increases.
After sampling the data, the tests under two scenarios are compared. The first considers the case where the researcher assumes the correct Hardcore process sampling mechanism for the Monte Carlo realisations of . The second considers the case when the researcher misspecifies it as an IPP (i.e. assumes no inter-point interaction).
The results from this simulation study, repeated 100 times, are shown in Fig. 11. As expected, the residual tests far outperform the NN test when the radius of interaction is 0.05. This is due to the lack of contrast in the nearest neighbour distances that leads to a reduction in the power of the NN test. When the radius of interaction is 0.025, the contrast in the nearest neighbour distances is restored and both methods perform very well again. In this case, the power exceeds 0.95 when the correct Hard Core process is fitted. Interestingly, the residual test that use the raw residuals from the correctly specified Hard Core processes perform no better than the residual test that use raw residuals of the incorrectly specified HPP. On the other hand, the performance of the NN test improves when the class of point process is correctly specified.