A general framework for estimating the spatio-temporal distribution of a species using multiple data types

10/31/2019
by   Joe Watson, et al.
0

Species distribution models (SDMs) are useful tools to help ecologists quantify species-environment relationships, and they are increasingly used to help determine the impacts of climate and habitat changes on species. In practice, data are often collections of presence-only sightings from citizen scientists, whose search effort may be highly heterogeneous throughout the study region. Many standard modelling approaches ignore search effort, which can severely bias SDMs. We present a novel marked spatio-temporal point process framework for combining data of differing type and quality. We apply this to combine multiple datasets collected on the endangered Southern Resident Killer Whales, to estimate their monthly effort-corrected space-use.

READ FULL TEXT VIEW PDF

page 6

page 14

page 20

page 21

page 40

page 42

12/20/2019

A mechanistic-statistical species distribution model to explain and forecast wolf (Canis lupus) colonization in South-Eastern France

Species distribution models (SDMs) are important statistical tools for e...
10/19/2017

Visual Analysis of Spatio-Temporal Event Predictions: Investigating the Spread Dynamics of Invasive Species

Invasive species are a major cause of ecological damage and commercial l...
08/01/2016

Example Data Sets and Collections for BeSpaceD Explained

In this report, we present example data sets and collections for the BeS...
01/25/2022

A deep mixture density network for outlier-corrected interpolation of crowd-sourced weather data

As the costs of sensors and associated IT infrastructure decreases - as ...
06/30/2020

When and where: estimating the date and location of introduction for exotic pests and pathogens

A fundamental question during the outbreak of a novel disease or invasio...

1 Introduction

Accurate knowledge of the spatio-temporal distribution of a species is vital for ecologists to understand the effects of climate and habitat changes on the species and for governments to implement successful management policies. Species distribution models (SDM’s hereafter) are tools for predicting the density of a species and for relating it to measured environmental covariates. Unfortunately, most SDM’s place stringent assumptions on the data collection protocols. In practice, data may severely violate these assumptions, and large biases in conclusions may follow. In response to this, there has been a recent development of methods that account for assumption violations (see miller2019recent for a recent review). This paper builds upon this work to present an integrative general framework for estimating the spatio-temporal distribution of a species.

It is crucial to consider search effort when fitting SDM’s (miller2019recent). The search effort attributable to opportunistic, presence-only datasets (e.g. citizen science datasets) is often unknown. Furthermore, it is rarely of constant intensity throughout the study region, and may be highest where the density of the species under study is highest (pennino2019accounting). This problem is often referred to as preferential sampling (watson2018general). Most standard SDM implementations for presence-only data assume a constant search effort intensity across space, and violations of this assumption can severely bias conclusions (elith2007predicting; phillips2009sample; dorazio2012predicting).

Furthermore, integrating data of differing types into an SDM can be difficult. Often, multiple datasets of differing type (e.g. presence-absence, presence-only, etc.,) and of differing data-collection protocol (distance sampling, opportunistic, etc.,) are available. Combination of these offers the potential for a greater precision in conclusions and thus improved management policies (fithian2015bias). Unfortunately, while many recent integrative models show promise for static species (e.g. plants), applying them to mobile species (e.g. mammals) in which the individuals can be identified has largely been overlooked. Additional autocorrelations often exist in the sightings of mobile species due to animal movement. Ignoring this can bias inference (johnson2013estimating). We pay special attention to mobile species, where individual animals can be identified.

Presence-only datasets are common and the last 15 years has seen an explosion of methodologies developed to analyse them (see hefley2016hierarchical; miller2019recent and references within). However, modeling presence-only data remains a statistical challenge, since no records of locations are available at which sightings were not made (i.e. absences)s. Furthermore, it is typical that little-to-no additional information is provided on the search effort (elith2007predicting)

. Earlier methods typically made use of logistic regression and MaxEnt models to link noisy binary observations of occurrence to a latent ‘true’ occupancy state

(phillips2006maximum; nichols2008multi). For the computation of both methods, a background sample of ‘available’ locations is required. Appropriate selection of these ‘pseudo-absences’ is nontrivial. The standard approach assumes uniformity of the true search effort throughout the study region (phillips2009sample). Bias can follow violations of this assumption, since areas identified as having a ‘high’ species density may simply be a result of high search effort in the region (dorazio2012predicting).

Various methods were developed for extending MaxEnt and logistic regression to account for heterogeneous search effort. Recently, ‘available’ locations were selected where ‘other’ species of a similar “Target Group” (e.g. bird species) were found in the opportunistic sightings databases. This may account for spatially-varying observer effort if the sightings of the Target Group species shared similar sources observer bias. This approach is referred to as the Target Group Background method and has been shown to improve the predictive performance when presence-only datasets are the only source available (elith2007predicting; phillips2009sample; fithian2015bias). However, output from this method must be interpreted as representing species composition, not the species’ distribution (warton2013model). This is because regions inaccessible to observers and regions of target group rareness cannot be distinguished (fithian2015bias).

Along with heterogeneous search effort, false positive and false negative recorded sightings are commonplace, and a failure to account for these can bias the understanding of spatio-temporal habitat use. Methods were developed to extend logistic regression and MaxEnt methods to account for false negatives (nichols2008multi) and false positives (miller2011improving; hanks2011reconciling), as well as to account for correlations between sighting events between different observers at the same site (clare2017pairing). The joint modeling of multiple species by sharing model terms across the species’ models, with the aim of improving predictive performance was introduced by elith2007predicting. Fundamentally however, logistic regression and MaxEnt models are not scale-invariant. Consequently, combining data sources of differing type is not straightforward.

As a result, the scale-invariant inhomogeneous Poisson process (IPP hereafter) has emerged as the most promising integrative model framework for jointly modeling data of varying type and quality (e.g. (renner2015point; giraud2016capitalizing; koshkina2017integrated)). These include: site count, presence-absence, and presence-only, and, all of the previous extensions carry over to the IPP. By sharing parameters and latent effects between the likelihoods of each data type within a joint model (e.g. (giraud2016capitalizing; bedrinana2018integrating)), strength can be borrowed and hence precision can be improved. For datasets of especially poor quality, allowing only a correlative relationship to exist between their models and the latent effects defining the SDM, may be more robust (pacifici2017integrating). Finally, the IPP does not face the ‘pseudo-absence’ problem of previous methods (warton2010poisson).

False negatives and observer biases may be present in the data collection process. Thinned point processes extend the IPP, allowing them to be incorporated (chakraborty2011point)

. The IPP intensity is multiplicatively decomposed into a detection probability surface capturing observer bias, and a surface representing true intensity. The former can be linked to a set of covariates associated with the amount of search effort or detectability (e.g. distance to the nearest road or terrain ruggedness)

(fithian2015bias). Use of these detection probability surfaces has been shown to improve the predictive performance of models (warton2013model). We refer to this approach as regression adjustment.

In some studies, observer effort may be known throughout the study region. Here, including covariates associated with observer effort may be unnecessary. Adjusting for known heterogeneous observer effort simply requires one to include it as an offset term in the linear predictor ((merow2016improving; pacifici2017integrating; chakraborty2011point)). This approach can also be used for including knowledge from expert maps (merow2017integrating).

Sightings of mobile species may be severely autocorrelated. Reported sightings of the same individuals may occur at a high frequency relative to their movement. This phenomenon may be more common in citizen science datasets due to a lack of standardized data collection protocol. Observers may follow an individual, or share knowledge of sightings. To model the autocorrelated data directly, the species’ movement process must be modeled. A failure to do this, risks biasing estimates of the species’ distribution (johnson2013estimating; mcdonald2013point). Furthermore, spatio-temporal correlations may also exist due to unmeasured spatially-smooth covariates and/or biological processes driving the true intensity of the target species (pacifici2017integrating; yuan2017point). Extending the thinned IPP framework to the log-Gaussian Cox process framework can help adjust for such autocorrelations (chakraborty2011point).

The above points are addressed within our general modeling framework. We adapt the raised incidence spatial point process model (diggle1990point)

from spatial epidemiology, to form the basis of our general framework. Our framework allows for the integrative modeling of mobile or static species. Datasets of differing type and data-collection protocol can be appropriately combined. We highlight the necessary assumptions required for the unbiased estimation of the intercept in the model. Observer effort can be controlled for, if either detectability covariates exist, or if observer effort is known or can be emulated. Expert knowledge can also be incorporated into the analysis. Uncertainties regarding the observer effort and expert knowledge can be fully propagated through to resulting inference. Additional spatio-temporal correlations can also be captured through the inclusion of random effects. Implementation is made especially easy using the R package inlabru

(R; inlabru), enabling researchers to adapt this framework for use in their applications.

The paper is structured as follows. First, we present our motivating problem: estimating the spatio-temporal distribution of the Southern Resident Killer Whale in the summer months. This species is of special conservation concern. Next, we introduce marked log-Gaussian Cox processes, and discuss their properties. After presenting our general framework, we then adapt it to our motivating problem. Finally, we design easily-interpretable maps to display the spatial distributions of the whales across the months. Computer code using the inlabru package (inlabru) is provided to enable researchers to adapt this framework for their applications.

2 Motivating Problem

2.1 An introduction to the problem

The Southern Resident Killer Whale (SRKW) population is listed as Endangered under both the Canadian Species at Risk Act (DFO_web) and the United States Endangered Species Act (NOAA_ESA_web) because of their small population size, low reproductive rate, the existence of a variety of anthropogenic threats and prey availability. The range of the SRKW extends from southeastern Alaska to central California; however, between May - September, all three pods of these whales frequent the waters of both Canada and the United States, concentrating in the Haro Strait, Georgia Strait and the Strait of Juan de Fuca (DFO_web). We extend our study region beyond these areas (see Figure 1).

The development of successful and effective policies to help protect the SRKW requires accurate, high-resolution knowledge on how their space use evolves across the year. The SRKW are highly social animals, spending the majority of their time in three well-defined groups called the J, K and L pods (ford1996killer). Inter-pod variation in the summer space-use exists hauser2007summer. Due to the differing characteristics of the three pods, policy-makers may benefit from having knowledge of pod-specific space-use to improve the effectiveness of management decisions. While SRKWs are known to favour the inshore waters of Washington State and British Columbia in the summer months (ford2017habitats; hauser2006evaluating), precise knowledge surrounding their space use across the months is lacking, as is precise knowledge surrounding the differences between the pods (hauser2007summer).

2.2 The data available

Multiple sources of SRKW sightings are available, including GPS-tracked focal follows of targeted individuals by citizen scientists, presence-only sightings from commercial whale-watching vessels, and opportunistic sightings reported by the public. However, we deem only two of these data sources to be suitable for use in an effort-corrected analysis. A source is judged to be reliable if we are able to estimate its search effort and if we can be confident that the source could accurately differentiate between the three SRKW pods.

The first source of reliable data are presence-absence data collected through a project funded by the Department of Fisheries and Oceans Canada. The data were collected from a vessel on the west side of the Strait of Juan de Fuca fitted with a GPS tracker between 2009 and 2016. The movements (i.e. GPS coordinates) of the vessel were recorded at a high frequency, and all sightings of SRKWs were reported, along with the pod. The marine mammal observer on this vessel is an expert on the SRKW, and thus, the reported pod classification can be considered accurate. The motivations of the Captain’s data collection varied from year to year, and hence, the spatial distribution of their search effort varied. We refer to this dataset as DFO hereafter.

The second reliable data source is the presence-only SRKW sightings reported by the whale-watch industry between 2009 - 2016 and collected by two organisations. The B.C. Cetacean Sightings Network (BCCSN) (BCCSN_web) and The OrcaMaster (OM) [The Whale Museum] (OM) datasets both contain the sightings from a vast range of observer types, but we exclusively model the whale-watch sightings for three reasons. First, the whale-watch operators have a high degree of expertise on the SRKW, with vessels typically having a biologist or other expert onboard resulting in accurate pod classifications. Second, the whale-watch companies are known to share the locations of the sighted SRKW between each other. Thus, our dataset likely contains the majority of whale watch sightings that were made, not just the subset of those made by the operators who report to the databases. Third, a vast amount of data has been collected on the activities of the whale-watching industry operating in the area. This enables us to estimate the observer effort from these companies with a high degree of accuracy and precision. We refer to this combined dataset as WW hereafter.

In fact, the majority of the data we collected on the activities of the whale watch industry came from the Soundwatch Boater Education Program (Soundwatch hereafter). Soundwatch is a vessel monitoring and public education outreach program that systematically monitors vessel activities around cetaceans during the whale-watch season (May - September) in the Haro Strait Region of the Salish Sea (seely2017soundwatch). Since 2004, Soundwatch has been using data collection protocols established in partnership with NOAA, DFO, and the Canadian Straitwatch Program. This includes detailed accounts of vessel types, whale-watch vessel numbers, and whale-watch vessel activities.

2.3 Previous work estimating the space use of SRKW

Previous work estimating the summer space use of SRKW has greatly assisted the development of critical habitat regions and has helped inform successful management initiatives. In the past 15 years, two pieces of published research tackled the problem from different angles. hauser2007summer estimated the summer space use of the SRKW in the summer months in a region within the Salish Sea. The authors assumed a constant search effort across the months from the whale-watch operators within their study region, based on results from a field study (hauser2006evaluating). Pod-specific core areas were identified and SRKW hotspots were clearly displayed. However, their study region was smaller than ours and did not include the full extent.

Most recently, OM estimated an effort-corrected map of SRKW summer space use. This expanded on the work of (hauser2007summer), by estimating the space use across a larger area than theirs, as well as by incorporating the heterogeneous search effort in their modeling directly. Regions of ‘high’ effort-adjusted whale density were identified and clearly presented in detailed plots. To reduce the impact of autocorrelation from the whales’ movements on the analysis, they defined ‘whale days’ as their target metric. They defined a whale day to be any day on which SRKWs were reported in a given area, regardless of the number of times they were reported on that day. A smaller study region relative to ours was studied, and no environmental covariates were used. Estimation of search effort followed previous unpublished research from the Vancouver Aquarium (VanAq, pers comm).

Figure 1: A plot showing our area of interest in green, with the GPS tracklines of the DFO survey effort displayed as black lines. All DFO survey sightings are shown as a red overlay on top of the lines. All sightings from the OM and BCCSN datasets are shown in yellow. Shown are sightings and tracklines from May - October 2009 - 2016.

2.4 Goals of the analysis

We aim to build upon the previous research and establish a model that can estimate a high-resolution, effort-corrected, and temporally-changing SRKW space use across the summer months (May - October). All estimates will be conditioned upon detailed estimates of search effort from the whale-watch companies, combined with GPS tracks from the DFO dataset. Unlike previous attempts, our model will use multiple informative environmental covariates such as sea-surface temperature and various measures of primary productivity (e.g. chlorophyll-A) to improve the accuracy of predicted maps. In addition, we aim to produce a model that is capable of evaluating pod-specific space use.

By turning to statistical/probabilistic modeling, we will attempt to account for all sources of uncertainties. Crucially, this includes the uncertainties associated with our estimates of the search effort from the whale-watch vessels. This has not previously been done. Finally, we will demonstrate how the methodology allows for the creation of maps that simultaneously display regions of high SRKW intensity for each month, along with their corresponding uncertainties. We display these for the month of May for exhibition. Habitat protection plays a major role in the protection plans for any endangered species, thus we hope our work can assist with future policy decisions surrounding the protection and management of the SRKW population.

3 Building the general framework

3.1 Log-Gaussian Cox processes as a suitable base model for SDM’s

Spatial point process models are suited for modeling sets of locations in space (illian2008statistical; baddeley2015spatial). In ecology, these are often used to model the presence of a species across a study area, called the domain . A key quantity of spatial point processes is the intensity surface. For each location , it defines the expected number of points per unit area immediately around s. In ecological applications, the normalized intensity surface estimated is then typically interpreted as representing the spatial distribution of the species under study. Spatio-temporal point processes allow the intensity surface to vary across time too. In this section, we describe the properties of a highly-flexible class of spatio-temporal point processes that are an ideal base model to use for our modeling framework.

For now, we assume perfect, uniform observer effort. We first describe an inhomogeneous spatio-temporal Poisson process (IPP hereafter) across bounded spatial and temporal domains and respectively. For this model, the number of points within any subregion and time set

is Poisson-distributed with mean

. is known as the intensity surface of the point process at location s and time . Conditioned upon knowing , and observing a point pattern (i.e. a collection of locations) within a time set , , the likelihood of a spatio-temporal IPP is:

(1)

where denotes the area of the domain and denotes the length of .

Linking the species’ intensity to a set of covariates (e.g. sea surface temperature) is often a key component of any ecological analysis to allow researchers to investigate species-environment relationships. They can then use these relationships to predict the density of a species through space and time, possibly extrapolating into areas beyond the study area , and beyond the temporal domain . As with many popular regression-based methods (linear models, GLMs, GAMs, etc.), may be a function of covariates at any location and time . Nonlinear transformations of these covariates, interactions, and splines can all be linked to the intensity, leading to a highly flexible modeling framework. Let denote the set of measured covariates at location and time . The following log linear model is a common choice:

(2)

This IPP may often be inadequate for use in ecological settings (pacifici2017integrating). The Poisson distribution assumed on the counts inside any subregion

, implies the variance of the counts is equal to the mean. If the amount of environmental variability not captured by the modeled covariates is high, then the variance of the counts may far exceed the mean. This is referred to as overdispersion. Serious consequences can result from failing to capture additional variability from unmeasured environmental processes and covariates. When this additional variability is spatially correlated and not controlled for, model-based confidence intervals can become overly-narrow and suffer from poor frequentist coverage

(baddeley2015spatial). Spurious ‘significance’ between the associations of covariates and the intensity may then be reported, unless computationally-intensive resampling methods, such as block-bootstrap, are performed (fithian2015bias).

Cox process models extend the IPP by treating the intensity surface as a realisation of a random field (baddeley2015spatial). This increases the flexibility of the point process models above, enabling their variance-mean relationships to be more flexible. The random fields from the Cox process models can be specified to capture spatial, temporal, and/or spatio-temporal correlations, helping to control for any unmeasured covariates and biological processes driving the true species’ intensity (yuan2017point).

A popular class of flexible random fields chosen are Gaussian (Markov) random fields, letting be a realisation of a log-Gaussian process (simpson2016going). These models are called log-Gaussian Cox processes (LGCPs). Specification of a LGCP is achieved by adding Gaussian processes to the linear predictor in (2). Denote a Gaussian process as . A LGCP with likelihood (1) and with added is defined:

(3)
(4)

where denotes the variance-covariance matrix of the Gaussian process evaluated at the locations and times . Different choices of covariance structures, lead to Gaussian processes with fundamentally different properties and uses. For example, placing separable kroneker product spatio-temporal correlation structures on the Gaussian process will help to capture any spatio-temporal correlations present. R packages such as spatstat and inlabru can fit such models (baddeley2014package; inlabru). We choose the LGCP as the base model for our framework.

3.2 Covariates to account for observer effort and detection probability

We now relax the assumption of uniform observer effort. Along with environmental covariates believed to explain the density of the species under study, covariates affecting the detectability of the species by the observers, or covariates associated with the amount of search effort expended throughout can also be included. For example, visibility indices and/or sea state may be included to adjust for changes in detection probability (fithian2015bias). Covariates may also be included to control for heterogeneous search effort (e.g. distance from the nearest road). Under the assumption that these covariates are included in their correct functional forms, this regression adjustment approach may fully capture the heterogeneity in the observer effort and remove the biasing effects of it (dorazio2014accounting). For applications with strongly informative covariates, such approaches have been shown to significantly improve predictive performance (elith2007predicting; fithian2015bias).

We choose to model the detection ability surface with a loglinear model, linking it to a set of covariates believed to affect only the observer abilities and not influence the true intensity of the species of interest. We denote such covariates and denote the detection ability surface . Sea state, visibility indices, observer ID, and vessel type are all plausible examples of covariates that could be included in . Next, we model the observer effort surface with a second loglinear model. We again link it to a set of covariates that do not directly impact the true intensity of the species. This time these covariates are believed to explain the heterogeneous observer effort within . An example of such a covariate could be distance from the nearest road. and denote the effort covariates and effort intensity surface respectively. After model-fitting, fixing both sets of covariates equal to a constant allows the effects of variable search effort and detection probability to be removed from predictions throughout .

Thus, our joint model for true species’ intensity, detection probability and observer effort becomes:

(5)
(6)
(7)
(8)

Estimation of the non-intercept terms within the detection , effort and the environmental parameters is possible, so long as all the corresponding covariates x, , and are not linearly dependent or interact (dorazio2014accounting). Non-perfect correlation between the two sets of covariates, whilst making estimation more difficult, does not affect the identifiability of these parameters (fithian2015bias). The intercept parameters are estimable if either there is independence between x, and , or if at least one accurate control dataset is included in the joint model (fithian2015bias). A control could be a survey with known search effort. Estimability implies that the species’ true abundance or encounter rate can be estimated by combining a mixture of data sources. Furthermore, observer-specific intercepts (i.e. relative observer efficiencies) may require significant spatial-overlap to exist between the search efforts of the different observers to avoid confounding with .

3.3 Generalising the base model with the addition of marks

Often, the spatial distributions of multiple point types (e.g. species) and their changes through time are desired (elith2007predicting; fithian2015bias). Furthermore, the modeling of characteristics of the target species (e.g. group size) may be desired alongside their location. Spatio-temporal point processes can be further generalized to marked spatio-temporal point processes to allow for a greater range of research questions to be tackled (e.g. chakraborty2011point)

. The main idea of marked point processes is that for each point, we observe attributes in addition to its location and time. These attributes are called marks. Marks might be categorical variables such as whale pod, count variables such as group size, or continuous variables such as body mass.

Marks add dimensions to the process under study. By placing a measure (i.e. a probability distribution) on each of the additional dimensions, called the mark distribution, we obtain the joint distribution of the locations and marks. Formally, we associate a random variable

(a mark) to each location and time of the random set . Let denote the support of a distribution of marks . The mark distribution is allowed to depend upon space and time (i.e. depend upon y), but is not allowed to depend on other points in . Thus, the for different are independent. Now the pair may be viewed as a random variable in the product space .

There is no limit to the number of marks that can be associated with each point. We simply need to include a probability distribution for each of the marks . The theorems in the Supplementary Material (taken from kingmanc) explains the following property of marked point processes that is especially useful for ecological inference. When the mark distribution is discrete (i.e. when ), as in the case of species identity, we can estimate the probability that a presence at a given location and time has mark . Formally, let denote the true intensity for the mark category (i.e. ). The probability for is:

(9)

In many cases, computing and plotting estimates of , the true mark-specific probabilities will be the inferential target, as this quantity helps to establish spatial niches specific to a species or species subgroup, whilst removing any observer and detectability biases. When some or all of the parameters and/or covariates are shared between the mark-specific intensities, cancellations will occur in (9), simplifying the computation. The supplementary material discusses the implications of Equation (9) on the Target Background Approach of elith2007predicting.

3.4 Raised incidence models for incorporating search effort

In many situations, strongly informative covariates of observer effort may not be available, but search effort may be known or directly estimable (e.g. through emulators or GPS records). In these cases, adapting raised incidence spatial point process models used in spatial epidemiology may provide an appealing alternative approach. This is especially true for mobile species where individuals may be encountered multiple times in different locations within a study, and when expected numbers of encounters scales linearly with the observer effort. The approach is similar to that proposed by merow2016improving; merow2017integrating.

Raised incidence spatial point process models (diggle1990point) have been developed for estimating disease intensity surfaces after controlling for spatially-varying population-at-risk density. The raised incidence spatial point process model assumes a multiplicative decomposition of the observed intensity surface into a fixed term representing natural spatial variation (e.g. population density) and a raised incidence surface . We can adapt this approach for ecological applications with mobile species. By fixing equal to the known or estimable intensity of search effort across , becomes the ‘per-unit effort’ encounter rate across . This is an intuitive target quantity. A flat now implies that a constant per-unit effort encounter rate exists across , and hence, that the target species exhibit complete spatial randomness. Similarly, regions of high indicate species ‘hotspots’.

Suppose we have an estimated search effort intensity surface . The critical assumption placed on is that doubling its value in a nonempty region, , doubles the value of the observed intensity , and hence, the expected number of observed sightings in . As before, our goal is to estimate . Finally, covariates affecting detectability of the species (e.g. visibility, light conditions etc.,) may act independently of the observer effort. Thus, one may wish to model a detectability surface alongside an estimated search effort field .

In many real-world scenarios, a doubling of the observer effort will not double the expected number of observed sightings. Multiple observers with overlapping fields-of-vision, and knowledge-transfer between observers are two examples where this assumption may collapse. A failure to properly adjust for violations to the assumption on could lead to biased estimated of . In cases where this assumption is severely violated, including the externally estimable search effort as a covariate within the observer effort surface may be more appropriate, when the effects of search effort are allowed to be nonlinear. For example, it may be suitable to allow for the effects of estimated search effort to plateau at high values. Alternative approaches could be to remove sightings from observers with known overlapping vision or to directly model the inter-observer autocorrelation (clare2017pairing).

3.5 The general model framework

We now combine the raised incidence spatio-temporal point process model framework with our earlier multiplicative LGCP model (5) for use in ecological settings. Let the spatial and temporal domains under study be and , and let the support of the marks be . Suppose we have a collection of locations, times, and marks . The general model then becomes:

(10)

Typically, we control for heterogeneous observer effort by either using a regression adjustment approach through or by including known or externally estimable effort as . In most cases it will be unwise to do both. When search effort is not known or externally estimable, we simply set and the model reduces to the earlier form (5). may vary across some of the marks m (e.g. species), but be constant across others (e.g. body mass).

Autocorrelation between the sightings of mobile species must be considered and may be especially concerning for citizen science or opportunistic datasets. If observations of the same individual or group are made too frequently in time, then the observed sighting locations will be autcorrelated due to the movement. A serious consequence of failing to adjust for autocorrelation is an under-estimation of uncertainty in resulting inference (johnson2013estimating). A simple adjustment is to subset the sightings so that sufficient time is allowed between sightings to let the individuals ‘mix’ within (similarly to (yuan2017point)). An alternative, but more demanding approach is to model the movement process directly by constructing suitable covariates (johnson2013estimating). This may have the added benefit of improving the statistical efficiency of estimators, if the movement is modeled appropriately.

Under the above general model framework, not only are we able to combine data from different observers, but also, combine different data types to estimate one species intensity surface (koshkina2017integrated). This is achievable as the intensity given by (10), can be linked to the likelihoods of several common data types. For example, the logistic regression likelihood for binary site presence-absence data, the Poisson likelihood for site count data and the LGCP likelihood for presence-only data can all be derived from the intensity (see hefley2016hierarchical and the supplementary material). Distance sampling methods have also been fit using LGCPs (see yuan2017point for details). All that is required for the suitable combination is the effort intensity , or an adequate set of strong predictors of search effort . This is a clear demonstration of the unifying potential of the IPP and LGCP for ecological data.

3.6 The computational steps for combining data of different types

Computing the likelihoods of different data types requires approximating the stochastic integral of the intensity function over well-defined areal and temporal units. Thus, it is crucial to know the precise spatial definition of any ‘sites’ and the definition of the time periods studied. For example, if count data from transects , collected in , is to be modeled, then knowledge of the positioning, shape, and size of the transects is required to compute the likelihood (gelfand2019preferential). Combining datasets of differing type is made especially easy in the R package inlabru (inlabru). We provide inlabru code in the supplementary material to demonstrate the ability of our general model framework to seamlessly combine datasets of different types. Our chosen example combines a hypothetical distance-sampling survey with a presence-only dataset in which the search effort is estimable. We provide a short summary of the approximate computation of the LGCP likelihood.

The LGCP likelihood above (1) is analytically intractable, as it requires the integral of the intensity surface, which typically cannot be calculated explicitly. However, various methods exist for approximating this integral. We consider the approximation method from simpson2016going. We present the spatial-only setting (i.e. ) for notational convenience. First, suitable integration points are chosen in with known corresponding areas (see simpson2016going). Then, the first indices are defined to be the chosen integration points with the last indices chosen as the observed locations of the sightings . Then, define and . Ignoring marks for notational convenience, we define . Assuming for the time being, we obtain:

(11)

We can see that the stochastic integral is only approximated across the first integration points, hence the name. The expected count around an integration point scales linearly with the area associated with it. This is under the assumption that for fixed intensity, doubling the area of a region, doubles the expected count of points falling within the region. The problem of evaluating (1) is reduced to a problem similar to evaluating independent Poisson random variables, conditional on , with means and ‘observed’ values . This is a Riemann sum approximation to the integral. In standard software, the natural logarithm of the weights is added as an offset in the model. It turns out that equation (11) can be fit with standard statistical software if one defines the minor modification that is defined to be zero if . This is implemented as standard in the R-INLA package (rue2009approximate; lindgren2011explicit; lindgren2015bayesian).

Including the estimable effort field in the model simply requires the numerical integration of around the chosen integration points . We denote the chosen regions around the integration points, . These may correspond to regular lattice cells or as in our example, irregular Voronoi polygons (Figure 2). The values of these approximate integrals then replace the offset terms . Thus, we form new values for the terms :

(12)

From a computational standpoint, this equation requires no additional work when the are known. This is similar to the offset approach used in chakraborty2011point and merow2017integrating. Often there will be uncertainty surrounding , and hence, the values.

Figure 2: The computational mesh on the left and the corresponding dual mesh on the right, formed by constructing Voronoi polygons around the mesh vertices. The Voronoi polygons form our integration points .

4 Results

4.1 Special considerations required for our motivating problem

Now we apply the general framework to our motivating problem. Our motivating dataset contains several special features that require careful consideration first. Figure 1 shows the raw presence-only sightings, as reported by the whale watch operators, along with the presence-absence data recorded from the DFO survey. Almost no spatial overlap exists between the two sources of sightings, due to their disjoint search efforts. Consequently, under our LGCP framework, any intercept term added to for capturing the relative observer efficiencies between the two observer types will not be estimable and will be confounded with the spatial field . Since both observer types involve similarly-sized vessels, we make the assumption that the efficiencies across the two observer types are identical. Because the whale-watch sightings are not linked to a specific vessel, we further assume that the relative observer efficiencies across the whale watch vessels are the same.

The pod identities of the sightings can be considered known due to the high degree of expertise onboard the whale watch boats and the DFO survey vessels. Furthermore, the data is managed by the BCCSN and OM teams, both with their own quality assurance protocols. This simplifies the analysis and eliminates the need for a two-step likelihood, as seen in hanks2011reconciling. Our observable unit is on a pod level and thus pods will serve as marks in our multi-type LGCP model. Pods are often found swimming together in ’super-pods’. We break up sightings of super-pods into their individual components. For example, if a sighting of super-pod JK is made (i.e. J and K are found together), then we record this as a sighting of pod J and a sighting of pod K and ignore the potential interaction between the two pods.

This data is severely autocorrelated. Sightings are often made of the same pod in quick succession, and the locations of whale pod sightings are shared between whale watch operators. In fact, once a pod has been sighted, it is rarely lost by the tour operators for the remainder of the day. Whilst this leads to autocorrelated sightings being made after the initial ‘discovery’, we can be confident that we have accurate estimates of the locations and times of the initial sightings.

We consider only the first sightings per day of each pod, discarding all repeated sightings made within a day. The fast speed of the whales relative to leads to an overnight window between sightings being sufficient to remove the autocorrelation between sightings. Although this procedure places an upper bound on the total number of sightings possible, violating the Poisson process assumption, we are confident this does not impact the results. The total number of sightings sampled from the posterior distribution of the fitted model rarely exceeded this upper bound. Note that subsetting the data to the initial sightings has big implications for estimating the search effort. We discuss this next.

4.2 Incorporating the search effort from the DFO data

We discretize time into months. Let the pod index be denoted , let the year be denoted , let the month be denoted and let the day be denoted .

We use the GPS tracklines of the DFO’s search data to estimate the search effort for this dataset (denoted hereafter). Our GPS data is not regularly spaced, although typically the resolution is around 15 seconds. To predict the location on a regular 30 second scale, we fit a continuous time correlated random walk to each unique trip using the crawl package (crawl1; crawl2), and then, count up the number of predicted points that fall into the regions defined by each integration point. Importantly, for each day and for each pod, we discard all predicted GPS locations that occur after the initial sighting has been made by either a boat from the DFO survey or by the whale-watch boats. Next, we scale the counts to ensure the units are in hours.

Note that we have assumed that at each 30 second interval, the search effort is contained to the integration point that the vessel currently lies in. We did not attempt to estimate a detection function that decays as a function of distance from the vessel for the following three reasons. First, we only have the GPS location of the boat and not of the sighting. Second, our regions are coarse in size relative to maximum possible detection distance. Finally, we estimate the positions of the boat over a very high resolution time step. Thus, the approximation we take is unlikely to lead to large biases.

4.3 Estimating the whale watch observer effort

Incorporating the search effort from the whale watching vessels is much more complex. We build a stochastic emulator to model the number of ‘boat-hours’ spent in a high-resolution grid of pixels by the whale-watch companies for each day, month, and year under study. We then project these estimated ‘boat-hours’ onto the regions defined by our chosen integration points . We are interested in modeling the changing pod-specific space use across the months, under the assumption it does not change across the years. Under our assumption that the locations of initial sightings are independent, we add up our estimates of daily search effort across the years to form our monthly search effort. How we establish our estimates of ‘boat hours’ for the whale-watch companies is the focus of this section and is discussed in greater length in the supplementary material. Throughout this section, we refer to the whale-watch search effort intensity as .

To estimate our search effort for each day and for each pod, we first record the number of hours into the operational day at which the initial discoveries were made. We denote this . We assume that the daily operational period for the whale-watch companies is 9am - 6pm (seely2017soundwatch), thus . As an example, suppose that on a given day, pods J and K were both sighted at 12pm and pod L was never sighted. Then would be recorded as 3 hours for pods J and K and 9 hours for pod L. Soundwatch reports that whale watch search effort intensity is nonconstant between 9am - 6pm. Using their reported numbers of vessels with whales by hour of day, we estimate a cumulative density function to account for the changing effort throughout the day . For a given pod, represents the fraction of total whale watch effort spent that day prior to the initial sighting of that pod.

Let denote the number of hours after 9am when the first sighting of pod , in year , in month and on day occurs. No sighting for a given day, or a sighting made after 6pm in either dataset, both result in . Under our independence assumptions, the fraction of total WW search effort spent prior to the initial sightings of pod in a given month/year is:

(13)

Next, we need to estimate the maximum possible number of boat hours of search effort per day, per month and per year. We denote it . We will then multiply the month, year sums by the fraction (13). The result will be an estimate of the search effort associated with the initial sightings. This requires some strong assumptions that are detailed in the supplementary material, including that the spatial distribution of the whale watch boat search effort is constant throughout the operational day. This implies, amongst other things, that the average distances of the boats from port are constant throughout the day.

Soundwatch reports on: the number of active whale-watch ports per year, the maximum number of trips departing each day from each port, the changing number of daily trips across the months, and the duration (in hours) of the trips from each port. We also download wind-speed data and ask various operators for their operational guidelines on cancellations due to poor weather/sea state. We remove days considered ‘dangerous’ based on these findings. Given the large sources of uncertainties associated with estimating the above quantities, we formulate probability distributions to appropriately express the uncertainties with each of our estimates. These probability distributions form the backbone of our stochastic emulator of .

To estimate the spatial distribution of the search effort, we estimate how many boat hours could fall in each of our integration points per month and year. Estimates of maximum travel ranges from each port are obtained, considering land as a barrier. Next, typical vessel routes from the whale-watching companies are established through private communications with the operators, from Soundwatch reports, and by consulting the operators’ flyers and websites. Combining these together, we then formulate plausible effort fields from each port by hand using GIS tools.

We denote , the maximum possible search effort intensity for year , month and at location . It is subject to the following constraint:

Now, under the assumptions laid out in the supplementary material, our estimate of the pod-specific monthly whale watch search effort surface associated with our initial daily sightings is:

(14)

Finally, given that we have assumed a constant observer efficiency between the whale watch and DFO survey boats, we may simply sum the two search effort layers together to get the total search effort surface for the two data sources:

(15)

Large uncertainties surround our estimates of the whale watch search effort, and failing to account for these uncertainties could lead to over-confident inference. We produce Monte Carlo samples of the effort field . For each sampled search effort field, we then fit the LGCP model and sample once from the posterior distributions of all the parameters and random effects. This new posterior distribution will account for the uncertainty in search effort, so long as is chosen sufficiently large to adequately capture the shape of the distribution and reduce the Monte Carlo error. We choose .

4.4 Model selection

We propose and fit several candidate models of increasing complexity for analysis. We fit the models using the R-INLA package with the SPDE approach (rue2009approximate; lindgren2011explicit; lindgren2015bayesian; R). All models use the estimated search effort field , with no detectability or search effort covariates used (i.e. and ). Model candidates start from the simplest complete spatial randomness model. This assumes that conditioned on search effort, sighting locations for each pod and month arise from a homogeneous Poisson process. They finish with models for which include: covariates, temporal splines, and Gaussian (Markov) random fields with separable spatio-temporal covariance structures. Note that to avoid excessive computation time, we perform model selection on a single realisation of our search effort field. Once our ‘best’ model is selected, we propagate our uncertainty associated with the estimation of the search effort through to the results using the Monte Carlo approach outline earlier.

We explore 2 spacetime covariates and one spatial covariate: sea-surface temperature (SST), chlorophyll-A (chl-A), and depth. Covariates were downloaded from the ERDDAP database (ERDDAP_data), with the monthly composite sea surface temperature and chlorophyll-A rasters (Figure 7) extracted from satellite level 3 images from the Moderate Resolution Imaging Spectroradiometer (MODIS) sensor onboard the Aqua satellite (Data set ID’s: erdMH1sstdmday and erdMH1chlamday respectively). We compare the two types of hierachical spacetime centering seen in yuan2017point.

Next, we explore random fields and splines. Including these adds a substantial amount of complexity to the model. To avoid over-fitting the data, we turn to various Bayesian model-selection techniques. Starting off with the simplest models without random effects, we iteratively increase the complexity of the model in a stepwise manner and compute the DIC value corresponding to each model. We then use it to assess and compare the models. DIC is similar to other information criterion such as AIC, attempting to trade-off the goodness-of-fit of the model with a penalty for the model’s complexity (spiegelhalter2002bayesian). Thus, simpler models are favoured unless adding complexity substantially improves the model fit. This reduces the risk of over-fitting.

We also conduct posterior predictive checks on the candidate models

(gelman1996posterior). For our posterior predictive checking, we assess the ability of the models to accurately estimate the total number of first sightings of each pod, per month. We also assess the models’ abilities to suitably capture the spatial trend by comparing the observed number of sightings falling within each region

with their model-estimated credible intervals.

4.5 The final selected model

The final ‘best’ model, as judged by DIC and posterior predictive check assessments, includes a spatial random field shared across the three pods, a spatial field unique to pod L, pod-specific temporal effects (as captured by second-order random walk processes), SST, and chl-A. Both covariates were spacetime centered. Depth was omitted as it was found to confound with the other covariates and the spatial field. Details of all models tested are in the supplementary material (see Table 2).

The importance of including a random field unique to pod L implies that pod L exhibits different space use compared with J and K. This is in agreement with the results of (hauser2007summer). No unique spatial field for pod J or K was found to significantly improve the model’s predictive performance. The pod-specific random-walks reflect the different times the pods arrive and leave the area of interest. For example pod J is found to remain in the area of interest across the months, whereas pods K and L are found to have lower intensity in May relative to September (Figure 8). This is in agreement with ford1996killer 104 pp.

4.6 Displaying the results

Large uncertainties surround our estimates of the SRKW intensity. Creating plots that clearly display both point estimates and uncertainty through space and time for the intensity

is a challenge. Commonly, side-by-side maps of posterior mean and posterior standard deviation are presented. However, determining regions of ‘high’ intensity using such maps is difficult. Readers must simultaneously balance the magnitudes of the posterior mean with the posterior standard deviation, to establish true hotspots. Exceedance maps help bypass this problem. By using our large number of samples (

) from our model, we are able to compute exceedance probabilities that can clearly display both point estimates with uncertainty in one map.

Exceedance maps display the posterior pointwise probabilities that the value of a random surface evaluated across a regular lattice grid of points exceeds a chosen threshold. For our application, we are interested in identifying regions of high whale intensity. As such, the maps will display the posterior pointwise probabilities for each month that the pod-specific intensity at location s, , lies above a chosen intensity threshold value (e.g., 70th percentile of that pod’s intensity across the months). ‘Hotspots’ are identified by displaying only the points that have a posterior pointwise probability above a probability threshold (e.g. 0.95, representing areas where the model predicts with 95% probability that the posterior intensity is in the top 30% of values for that pod). Such maps simultaneously present our point (i.e. ‘best’) estimates whilst also reflecting the uncertainties surrounding these estimates. For example, regions predicted to have a very high mean intensity, but also a large uncertainty (e.g. regions rarely visited, but where a few sightings were made), will no longer appear in the exceedance plots as the threshold is increased.

In practice, we create a regular lattice for and predict for each cell, the pointwise probability that the pod-specific intensity for a given month exceeds a chosen upper value. The upper value could be fixed across the months and pods, or be allowed to change across them. Finally, we remove pixels from the plot whose values lie below a chosen probability threshold. Note that these maps are similar to the excursions plots (bolin2015excursion), except that here the probability thresholds are pointwise.

For demonstration, we explore regions that our model confidently predicts to have a high J-pod intensity for the month of May. Initially, we specify our ‘high’ intensity value to exceed as being the 70th percentile of intensity found across all the months (found from the 1000 Monte Carlo samples of J-pod’s intensity field). Panel A in Figure 3

shows the posterior probability that the J-pod intensity in May lies in the top 30% of values. The plots show clear hotspots in J-pod’s May intensity in the West of the region and in inshore waters. We repeat the plot, but now colour all pixels grey for which a posterior probability of exceeding the 70th percentile value is below 0.95. This helps to differentiate the regions of interest that we are most confident about (Panel B of Figure

3). If we change the upper exceedance value to be the 70th percentile value for the month of May only, rather than across all months, the regions of interest are larger (Panels C and D of Figure 3).

Figure 3: A series of plots demonstrating the different types of plots possible under our general framework. Panels A and B show the posterior probability that J pod’s intensity across the region takes value in the upper 30% in the month of May. Panel A shows the raw probabilities, while the Panel B has a minimum probability threshold of 0.95. Panels C and D are same, however the upper 30% exceedance value is defined separately for each month. Panels E, F, and G show the posterior probabilities that a sighting made at a given location in May contains pods J, K and L respectively. All results are shown for the ‘best’ model with Monte Carlo search effort error.

Pod-probability maps can identify the core areas within associated with each pod and month. For a chosen month and pod , we define its ‘core area’ to be a region such that if a sighting is made within , there is a ‘high’ probability that it is of pod . Under our multi-type LGCP framework, because we can fix search effort, we are able to compute the posterior probabilities that a sighting made at a given location and month contains a specific pod (see equation (11)). For May, we display the posterior probabilities that a sighting made at location contain pod J, K and L respectively in panels E, F and G in Figure 3. It is apparent that pod J, known to be found in this region year-round (ford1996killer), is more likely to be seen in May than the other pods.

When using discrete marks, one can sum the intensities across the marks to create maps of ‘total intensity’. These maps may be especially useful for conservation purposes. For demonstrative purpose, we present an exceedance map for the sum of the three pods’ intensities. This exceedance map is representative of total SRKW intensity. Figure 4 shows the exceedance probabilities for the month of May. We fix the upper value to exceed, equal to the 70th percentile value of the sum of the three pod’s intensities across all months. We set the threshold probability to 0.95. Note that we have assumed that each pod is a single unit of identical size, and have not scaled the pod-specific intensities by their group sizes. Thus, the intensity represents the expected number of sightings of pods per unit intensity, per unit area.

Figure 4: A plot showing the posterior probability that the sum of the three pod’s intensities across the region takes value in the upper 30% for the month of May. The 30% exceedance value is computed across all the months. Shown are the probabilities of exceedance, with only the probabilities greater than 0.95 displayed. Results shown are for the ‘best’ model, adjusted for Monte Carlo search effort error.

5 Discussion

We have presented a general framework for estimating the spatio-temporal distribution of a species or group and demonstrated its use by identifying areas frequently used by an endangered species. Using the methodology, data from multiple observers, and data of varying quality and type, may all be combined to jointly estimate the spatio-temporal distribution. Crucially, high quality survey data can be combined with low quality opportunistic data, including presence-only data. Such data fusion can potentially improve the spatial resolution and statistical precision of estimates of the spatio-temporal distribution of the species (fithian2015bias; koshkina2017integrated). However, including presence-only data requires knowledge about the search effort, either directly (e.g. GPS records) or through a set of strong predictors. Example predictors include distance from the nearest road and terrain ruggedness. Data types compatible for modeling with this framework extend beyond those seen in this motivating example. log-Gaussian Cox processes (LGCPs) have the unifying feature of being able to provide a base model for deriving the likelihoods of many of the commonly found data, including presence-only, presence-absence, site occupancy, and site count data (miller2019recent).

Autocorrelation due to animal movement may be severe when studying mobile species and can bias results. We have proposed a subsetting method for removing the autocorrelation when studying mobile species. Such methods require a deep understanding of both the data collection methods and the species’ movement, to help establish the necessary time required between sightings to ensure the autocorrelation is removed. Secondly, autocorrelation due to residual spatio-temporal correlations in the data may be present. Unmeasured biological processes (e.g. prey availability) often exist that drive the true intensity of the target species and hence the intensity of their observed locations. Such unmeasured processes are typically spatio-temporally smooth, leading to residual spatio-temporal correlations to remain. Unlike IPPs, LGCP’s allow for the capture of such correlations through the inclusion of random fields in the intensity surface (yuan2017point). This is crucial for ecological applications and is a major reason for us choosing the LGCP as our base model (pacifici2017integrating). In fact, in our case study, our LGCP models showed improved model fit compared with the equivalent IPP models with respect to both DIC and posterior predictive checks. This clearly demonstrates the need to include such autocorrelations.

Furthermore, The LGCP was chosen since the Poisson process assumption imposes very strict mean variance relationships on all likelihoods derived from it (Poisson, Binomial etc.,). In reality, many (if not most) datasets will exhibit overdispersion, where the true variance of the data generating mechanism exceeds that captured by the model (chakraborty2011point)

. A failure to capture this will lead to narrow confidence intervals and over-confident standard errors being reported. On the other hand, LGCPs are by definition always overdispersed relative to the IPP

(baddeley2015spatial). In our case study, credible intervals from our ‘best’ LGCP model were far wider than the equivalent IPP, indicating that additional uncertainty appears to have been missed by the IPP. Note that the LGCP framework presented here can be extended to account for false positives; however, an additional layer of hierarchy is required (miller2011improving; hanks2011reconciling).

We have developed a general methodology for estimating effort-corrected maps of SRKW intensity, and hence, space use across a region surrounding the Salish Sea and beyond between May - September. Our methodology is a first-of-its kind for the SRKW, as it fully incorporates all known uncertainties with estimating the search effort through to the conclusions. Furthermore, unlike previous analyses, environmental covariates were used to improve predictions, demonstrating a clear relationship between the SRKW and chl-A/SST. Additional covariates such as prey availability and bathymetric gradients would likely improve the predictive power and should be considered in follow-up work. Additional residual spatio-temporal correlations were captured with Gaussian processes, another first for the SRKW. Their inclusion reduces the risk of reporting conclusions that are too confident. We presented our results with bespoke-made exceedance maps, to clearly display for each month the regions that we are confident the whale density is high. Our maps were qualitatively consistent with earlier research. For example, in the month of May, pod J was found most likely to be present in the area, in agreement with ford1996killer 104 pp.

While the mathematical theory underpinning the LGCP may appear challenging to many researchers, the application of these models is widely applicable. Recent developments in spatial point process R packages (R), such as spatstat (baddeley2014package) and inlabru (inlabru) facilitate their computation. Inlabru requires only basic knowledge of R packages such as sp (sp1; sp2), rgeos (bivand2013rgeos), and rgdal (bivand2015rgdalpackage). Pseudo-code is supplied in the supplementary material to show how a dataset with a combination of distance sampling survey data and opportunistic presence-only data could be analysed using this general framework. Joint models are fit and sampled from using only 7 function calls, emphasising the applicability of the framework across a wide range of disciplines.

A biology-focused companion paper is currently underway, using the final model outputs to explore SRKW habitat use and how it varies in this region across pods and summer months. Importantly, it will compare and contrast habitat use based on traditional opportunistic sightings data analyses and for the first time present relative SRKW habitat use across the entire extent of SRKW critical habitat in Canadian Pacific waters together with estimates of confidence. Thus the model development can be considered an important step in planning future SRKW conservation efforts and highlighting regions of ecological significance.

6 Acknowledgements

We would like to thank the DFO, BCCSN, The Whale Museum and NOAA for access to sightings databases. We thank Jason Wood (SMRU), Jennifer Olson (The Whale Museum) and Taylor Shedd (Soundwatch) for their detailed insight into the operations of the whale-watch industry. Additionally we would like to thank Eagle Wing Whale & Wildlife Tours for their substantial help with developing the effort layer for Victoria. This message of thanks extends to various other whale-watch companies who also assisted with the process of estimating the search effort. Finally, we would like to thank Jim Zidek for his consistent support and lively discussions throughout.

References

7 Supplementary Material

7.1 Additional theory on marked point processes

Start with a Poisson process on with intensity , and a probability distribution on depending on such that, for , is a measurable function on . A marking of is a random subset of such that the projection onto is and such that the conditional distribution of , given makes the independent with respective distributions . We now have the following theorems (adapted from kingmanc):

Theorem 1 (Marking Theorem).

The random subset is a Poisson process on with mean measure defined:

(16)
Theorem 2 (Mapping Theorem).

If the points form a Poisson process on , then the marks form a Poisson process on and the mean measure is obtained by setting in (8):

(17)

if the marks take on only different values, then the theorem specializes for the mark to:

(18)

7.2 The link between our general framework and the Target Group Background approach

Equation (9) also helps provide insight into the assumptions required for the predictions from the Target Background Approach of elith2007predicting to successfully recover the true intensity. Let our discrete marks denote the species within the Target Group that was detected and take values . Suppose index 1 (i.e. ) denotes the target species. Assume also, that the detectability and observer effort surfaces are identical across the species. Then if one uses the sightings from the non-target species as a background sample in a logistic regression to the binary indicator variables , one ends up with the formula:

(19)
(20)
(21)
(22)

for the log of the odds that an sighting made at location

is of the target species. Note that the change from to in step (20) is due to the Target Group Background assumption of constant observer effort and detectability biases across the species types which leads to cancellations in the fraction in (19). Investigating equation (20), it is revealed that the only way the target species intensity can be recovered up to a scale factor, is if the second term in equation (20) equals a constant. This requires . This relies on either the species-environment effects cancelling out across the species, or relies on the assumption that each of the species are spatially distributed completely at random in relative to the observer and detectability surface. Both of these assumptions are unlikely to hold. Consequently, in most situations, the Target Group Background approach can recover species composition, but not species intensity directly. This was discussed in (fithian2015bias).

7.3 Deriving site occurrence and site count likelihoods

Likelihoods for site occurrence and site count data can all be derived from the general framework if the true locations of the target species follow a log-Gaussian Cox process. With our study region, with a known sampled region (e.g. a transect) , and with a known or estimable observer effort captured by or , define the following quantity:

This is referred to as the integrated observed intensity function, conditional upon knowing the Gaussian process Z(s). Importantly, this represents the expected number of observed sightings within . Following hefley2016hierarchical, we can then derive the target likelihoods. Firstly, suppose that an observer records the number of sightings made within , denoted . Then the distribution of the number of counts, conditioned upon knowing , is:

In practice, the Gaussian process is not known and thus needs to be estimated. Consequently, the above likelihood falls into the category of a spatial generalised linear mixed effects model (SGLMM). Multiple software packages exist to fit such models (e.g. R-INLA). Next, suppose that instead of recording the number of sightings made within , a binary presence/absence indicator of presence (denoted ) was recorded. The distribution of this indicator variable can also be derived from the conditional Poisson distribution on the counts. In particular, let , with denoting the indicator function. Then the probability statement implies the following conditional distribution on the indicator variables:

Note that once again, the likelihood is of the SGLMM format which can be computed using standard software packages. Note also that computing the integrated observed intensity function is critical across the likelihoods.

7.4 Comments on preferential sampling

In the setting of this paper, preferential sampling would be defined as a stochastic dependence between the observer effort and the underlying species intensity. An example would be a setting where observers focused their search effort in areas with high species density, perhaps due to some prior knowledge on their likely locations. The biasing effects of preferential sampling on spatial prediction (diggle2010geostatistical) and on the estimation of the mean intensity in ecological applications (pennino2019accounting) have been shown. In particular, in the example above, spatial predictions in the unsampled regions would be positively biased, as would estimates of the mean intensity.

In many situations this general framework will suitably adjust inference for any heterogeneous search effort across , removing the biasing effects of preferential sampling. In cases where nonzero search effort exists throughout the study region (i.e. where , the estimation of will be unaffected by preferential sampling. However, when a subregion is never visited, (i.e. when , the estimation of ) within may be biased. To highlight this fact, suppose our study region is split into a northern region and a southern region . Suppose that the true intensity takes value 2 within and value 1 within . If only is visited, then without the availability of strong covariates explaining the differences across and , then any model will wrongly overestimate the true intensity in , namely the model will predict that .

To minimize the impacts of preferential sampling on any conclusions made using this general framework, extrapolating predictions into unsampled regions should be done with care, especially if it is believed that the intensity of observer effort may depend upon the underlying species’ intensity. This is standard advice in any statistical analysis and is not a limitation unique to this framework.

7.5 More notes on estimating the whale watch search effort

Two strong assumptions are required to allow us to multiply the total search effort field by the fraction of total search effort observed in a given month/year. We first assume that the expected spatial positions of the boats are constant throughout the time period of interest 9am - 6pm. We know that at the starts and ends of the days the boats will likely be closer to port. We assume however, that the whale watch boats are travelling independently in equilibrium (represented by our estimated search effort field ).

Second, we assume that the boats are spread out throughout sufficiently, such that the total search effort from all the vessels (assumed equal) is additive. In other words, we assume the whale-watch boats are sufficiently spread out, such that their observation ranges do not overlap. In reality the whale-watch vessels often visit similar nature ‘hotspots’ and hence traverse similar routes. As a consequence, they may travel close together at certain times. At these times, their combined search effort may not scale linearly with the number of the boats.

Given that we have chosen months to be our discretization of time, we must estimate the monthly search effort across space, adding up the contributions of effort across the years of interest (2009 - 2016).

We define boat hours to be our unit of search effort, kilometers to be our unit of distance and month to be our unit of time. Thus, denotes the number of WW boat hours of search effort, per unit area that occurred for pod , at location s and month , summed over all the years of the study. In a similar flavour to the intensity surface, the effort surface is not really defined pointwise, but defined over regions of non-zero area as an integral. In particular, for a region and month , we define the total search effort that occured inside in boat hours to be:

(23)

For later computation of the LGCP, we approximate the stochastic integral required for the likelihood over a finite set of integration points. Thus, we are required to compute the integrals of the effort field over the integration points. This is shown in equation (12).

7.6 Testing the assumptions on the search effort

One way of assessing the suitability of our estimated search effort, and hence our model is to assess the associations between the SRKW density and the estimated search effort. This also allows us to investigate the presence of preferential sampling.

We form nonparameteric estimates of the intensity function (denoted at location s

) for each pod and for each month. Here, we investigate how the observed intensity changes with the estimated search effort. We estimate the trends with the search effort by implementing a Gaussian kernel smoother. We first select the bandwidth according to Silverman’s ‘rule of thumb’ to let the data ‘speak for itself’. In particular a kernel density estimator attempts to discover the following trend:

Here represents mathematically our observed intensity of our pod for a chosen month, and we attempt to find a function of a covariate that reflects this. In our example, the estimated search effort intensity is our covariate . Observing the shape of the estimated function plotted against the search effort allows us to assess the plausibility of our estimated search effort.

For example, given that the majority of our search effort comes from the whale-watch operators (between 30-50 times more than the DFO data), if the whale-watch operators focus their efforts independently from the SRKW density, then we would expect to observe a straight line relationship between intensity and search effort. This is because the expected number of sightings per unit area at a location s, would be proportional to the number of hours spent looking at s. Put differently would be estimated to be close to a straight line.

If, in fact, whale-watch operators were focusing their efforts in regions with high SRKW intensity, then we would expect to see more SRKW found in regions estimated to have higher search effort than can be explained by a linear trend alone. Figure 5 shows precisely this trend for each month. The estimated SRKW intensity grows at a superlinear rate as a function of search effort, before plateauing at very high search effort values.

In summary then, under our estimates of search effort, the whale-watch companies are found to be exploring and visiting regions of high SRKW preferentially. This matches our beliefs regarding the operations of these companies - whale-watch operators have a vested interest in visiting regions with highest SRKW density to increase their chances of detection and hence to provide a good service. Thus, this provides additional confidence in our estimates of search effort.

Note that some or all of the shape of could be a result of the effects of accessibility, covariates and various biological processes on the SRKW intensity. We control for these effects in our LGCP models. Finally, all of the plots appear to show a second repetition of the plateauing shape. This may be due to the DFO and whale watch search effort fields both exhibiting a similar preferential sampling mechanism, but being on vastly different magnitudes of effort.

Figure 5: A plot showing the kernel density estimate of the intensity function for pod J (a measure of the expected number of sightings) as a function of search effort, plotted against the search effort density (boat hours per km) for each month with 95% pointwise confidence intervals shown. The kernel density estimate shown has been made using the automatic Silverman’s ‘rule of thumb’ bandwidth selection. Note that the notches represent the values of search effort density at the sightings locations.

7.7 Additional details on the results and additional tables

Environmental covariates were mapped to the integration points and to the sighting locations for modeling. In cases where we had noisy covariates with missing values, we chose the median covariate value (out of those that spatially-intersect the Voronoi polygon) as the polygon’s ‘representative’. For missing covariates at observation locations, we mapped the non-missing value which was closest in distance to the observation location. Sea-surface temperature (SST) and (log) chlorophyll-A (chl-A) levels were obtained. Monthly chl-A and SST were obtained for each year and averaged over the years. Log transformed covariates were centered to have mean 0 and scaled to have unit variance. Sea surface temperature was not scaled for interpretation reasons.

Next we performed hierarchical centering of our SST and chl-A covariates. This is following the advice of yuan2017point, where it was shown that three unique biological insights can be obtained per covariate. In particular, we performed two types of centering: spatial and spacetime centering. Centering covariates like this can also improve the predictive performance of models. The 2 hierachical centering schemes applied to both SST and chl-A were compared. We refer to these as covariate sets 1 and 2.

Models that include a wide range of different latent effects are compared. A unique (sum-to-zero constrained) random walk of second order for each pod is tested, alongside a shared spatial and/or spatio-temporal Gaussian (markov) field across the pods and a unique spatial field for pod L. For the random walk term, we shared the precision parameter across the pods. We put INLA’s default Gamma prior distribution on this shared precision. Finally a unique intercept was allowed for each pod. The unique intercepts per pod allow for a different global intensity for each pod to exist across the months, whilst the unique random walk terms per pod allow for a changing relative intensity of each pod across the months. This is chosen based on previous work that found pod J to be the most likely to be present in the Salish Sea year-round (ford2017habitats).

We also fit the models without covariates included in the linear predictor and hence only with spatial and spatio-temporal terms included in the model. We also fit the models with the covariates kept in, but with the spatial random fields removed. These are inhomogeneous Poisson processes. We do this to attempt to show how the variability seen in the data is captured by covariates and random effects. We also do this to investigate whether or not the spatial distribution of the SRKW intensity (conditioned on the search effort), changes with month, or whether or not it is spatially static across the months.

For all spatial fields, we place PC priors (fuglstad2018constructing)

on the GMRF, with a prior probability of 0.01 that the ranges of the fields are less than 15km. We also place a prior probability of 0.1 that the standard deviations of the fields exceed 3. Thus, our prior beliefs are that the fields are smooth (i.e. the ranges are not too small) and are not too large in amplitude (i.e. the standard deviations are not too large). We do this to reduce the risk of over-fitting the data. The PC priors penalize departures from our prior beliefs under the Occam’s razor principle; penalizing models with greater complexity than that specified in our prior.

Now we display the table of coefficients from the ‘best’ model, and the table of DIC values of all tested candidate models. Finally, we display our model-estimated number of sightings per pod and per month, with 95% credible intervals. We also display the observed number of sightings to check the model’s calibration.

Mean SD 0.025 Q 0.5 Q 0.975 Q
Pod J -3.84 0.71 -5.23 -3.83 -2.44 -
Pod K -4.57 0.71 -5.95 -4.56 -3.20 -
Pod L -3.95 0.98 -5.81 -3.96 -1.99 -
SST month avg 0.03 0.26 -0.49 0.04 0.54 -
SST spatial avg -0.37 0.17 -0.70 -0.37 -0.05 -
chl-A month avg 0.31 1.11 -1.89 0.26 2.58 -
chl-A spatial avg -1.03 0.32 -1.67 -1.02 -0.38 -
SST ST residual -0.67 0.05 -0.77 -0.67 -0.57 -
chl-A ST residual -0.23 0.07 -0.38 -0.23 -0.09 -
Table 1: A table of posterior estimates of the fixed effects , with their 95% posterior credble intervals for the final Model 8 repeatedly fit with 1000 Monte Carlo estimated search effort fields. Note that the symbol * denotes ‘significance’ such that the 95% credible intervals do not cover 0 (no effect), or for the pods represents no difference was found between the relative pod intensities with respect to their 95% credible intervals. The ‘change’ column displays the change in ‘significance’ of the effect size compared with the results from model 8 without the additional MC error from the search effort. The ‘-’ symbol denotes no change in significance. None of the directions and hence qualitative conclusions of the effect estimates change.
Model DIC DIC Covariate Set Shared Field Field for L
0 3614 5554
1 2843 4783 1
2 2707 4642 2
3 -1633 307 Spatial
4 -1730 210 Spatio-temporal
5 -1842 98 1 Spatial
6 -1851 89 2 Spatial
7 -1931 9 1 Spatial Spatial
8 -1940 0 2 Spatial Spatial
9 NA NA 1 Spatio-temporal
10 NA NA 2 Spatio-temporal
11 NA NA 1 Spatio-temporal Spatial
12 NA NA 2 Spatio-temporal Spatial
Table 2: A table showing the DIC values of all the models tested, with the model formulations summarised in the columns. A value of NA implies that model convergence issues occurred.
Figure 6: A plot showing the total observed number of sightings made per month with the posterior 95% credible intervals shown. Results shown are for Model 8 with MC search effort error. The posterior predictions are made, given the identical search effort to that estimated for the observed data. Also shown are the horizontal lines showing the maximum possible number of sightings that could be made in months with 30 and 31 days respectively. Posterior credible intervals extending above this upper bound imply the Poisson model is severely misspecified.

7.8 Pseudo-code for computing the general framework in inlabru

Fitting log-Gaussian Cox process models is greatly simplified with the use of the R package inlabru (inlabru). Furthermore, inlabru fits joint models containing many (possibly different) likelihoods, and is able to share parameters and latent effects between them with ease. Numerous other features exist and helper functions are provided to help produce publication-quality plots. For full information and for free tutorials, visit the inlabru website at inlabru.org. The following pseudo-code is largely based on the available tutorials.

In this section we will demonstrate the simplicity of fitting a joint model to a dataset comprised of a distance sampling survey, and a presence-only dataset with a corresponding search effort field using inlabru’s syntax. For simplicity, suppose we have 1 continuous environmental covariate, called covar1 (e.g. SST), and that it is in the ‘SpatialGridDataFrame’ or ‘SpatialPixelsDataFrame’ class. Next, suppose we have an estimate of the natural logarithm of the search effort for the presence-only data called logeffort_po, also of class ‘SpatialGridDataFrame’ or ‘SpatialPixelsDataFrame’. We assume that the effort took values strictly greater than 0 everywhere before taking the logarithm (or that we have added a small constant to enforce this).

Suppose we have the observed sighting locations of the species of interest as two separate objects of class ‘SpatialPointsDataFrame’, one for the survey sightings and one for the presence-only sightings. Call these surv_points and po_points and suppose we have thinned the data to ensure any autocorrelation has been removed. Suppose also that we have our transect lines from the survey as an object called surveylines in the class ‘SpatialLinesDataFrame’, and that we know the transect strip half-width (denoted ). Finally, suppose that our spatial domain of interest is described by an object called boundary of ‘SpatialPolygonsDataFrame’ class. All ‘Spatial’ objects in the sp package (sp1, sp2) must be in the same coordinate reference system.

Suppose we wished to estimate a half-normal detection probability function, as a function of distance. To program this in inlabru, we must first define the half-norm detection probability function in R (R). Let ‘logsigma’ denote the natural logarithm of the standard deviation and ‘distance’ denote the perpendicular distance from the transect to the observed point. Then our function is:

halfnorm = function(distance, logsigma){
  exp(-0.5*(distance/exp(logsigma))^2)
}.

Next, given a well constructed Delauney triangulation mesh called ‘mesh’, we construct the spatial random field for the LGCP. Helper functions exist for creating appropriate meshes in inlabru. The code for creating the spatial field, with Matern covariance structure is:

matern <- inla.spde2.pcmatern(mesh,
 prior.sigma = c(upper_sigma, prior_probs),
 prior.range = c(lower_range, prior_probr)).

Here, upper_sigma, prior_probs, lower_range and prior_probr all define the parameters of the pc prior on the random field (fuglstad2018constructing). Once again, the tutorials help assist with the choice of prior. Now, we define all the parameters and terms in the model that must be estimated:

mod_components <- ~ mySpatialField(map = coordinates, model = matern) +
  beta.covar1(map = covar1, model = linear’) +
  po_search_effort(map = logeffort_po, model = linear’,
                   mean.linear=1, prec.linear=1e20) +
  logsigma + Intercept_Survey + Intercept_PO.

Note here that we choose the prior mean and precision of the ‘po_search_effort’ field to enforce it to enter the model as an offset. Now we can create the likelihood objects for both data types, each with their own formulae, but sharing components.

lik_surv <- like(‘cp’,
             formula = coordinates ~ Intercept_Survey + mySpatialField +
                       beta.covar1 + log(halfnorm(distance, logsigma)) +
                       log(1/W),
             data = surv_points,
             components = mod_components,
             samplers = surveylines,
             domain = list(coordinates = mesh))
lik_po <- like(‘cp’,
             formula = coordinates ~ Intercept_PO + mySpatialField +
                       beta.covar1 + po_search_effort,
             data = po_points,
             components = mod_components,
             samplers = boundary,
             domain = list(coordinates = mesh)

And then we can fit the joint model and simulate samples of all of the parameters and latent effects from the posterior distribution.

fit_joint <- bru(mod_components, lik_surv, lik_po)
posterior_samples <- generate(fit_joint, n.samples = M).

Note that once the model object (fit_joint) is created, the estimated field can be easily plotted, and predictions can easily be made on new datasets and at new locations. Stochastic integration of the field to estimate abundance (for suitable datasets) is also possible using inlabru helper functions. Details can be found in the inlabru tutorials. The above code can scale up to include multiple environmental covariates (including categorical predictors), spatio-temporal fields, and/or temporal effects. Likelihoods of different type (e.g. Bernoulli, Poisson, Gaussian etc.,) can all be included, with this feature becoming especially useful for when the joint estimation of data of differing type is desired.

7.9 Additional figures

7.9.1 Covariate plots

Figure 7: Plots showing the average monthly sea-surface temperatures (top 6) and the natural logarithm of chlorophyll-A concentrations (bottom 6). The averages have been taken over the years 2009-2016.

7.9.2 Plot of the pod-specific random walk effects

Figure 8: A plot showing the posterior mean and posterior 95% credible intervals of the pod-specific (sum-to-zero constrained) random walk monthly effect from the ‘best’ model with Monte Carlo search effort error included.

7.9.3 Plot of model standard deviation

Figure 9: A plot showing the posterior standard deviation of the sum of the SRKW intensities for the three pods, for the month of May. The qualitative behaviour is almost identical across all pods and across all months, so we omit them. Results shown are for Model 8 with MC search effort error.