1.1 Background and motivation
Robust measures of biodiversity change are vital for monitoring the varying state of species’ populations and evaluating progress of conservation actions, for example towards national and international targets (butchart2010global)
. Data from standardized, long-running monitoring schemes are used to produce estimates of species’ status and trends, particularly in terms of changes in abundance. However such data sources are limited taxonomically and geographically. By their nature of intensive, formal sampling they may be limited in spatial coverage and therefore cannot always be used to appropriately measure changes in species’ distributions over time.
Conversely, opportunistic records of occurrence are often and increasingly available in large quantities for extensive geographic areas and time periods, and for a wide variety of taxa. However opportunistic data are inherently biased (isaac2015bias). Data are typically presence-only, where records only indicate where and when a species is seen rather than including information on non-detection, unless complete lists are recorded. Data recording the distribution of animals and plants are frequently analysed using occupancy models (mackenzie2018occupancy), as they allow for imperfect detection. Applying such models to presence-only data requires non-detections to be inferred from the observations of other species (kery2010site). Data of this nature are not standardised, and result from the submission and collation of records by citizen scientists who choose where, when and what to record, but are often available in large quantities. For example the Global Biodiversity Information Facility (GBIF) consists of more than 1.6 billion occurrence records for at least one million species (www.gbif.org). In the United Kingdom, extensive occurrence data are available for many taxonomic groups, and the Biological Recording Centre oversees more than 80 recording schemes (brc.ac.uk, pocock2015biological). Such data are commonly used to produce atlases for various taxa (e.g. blockeel2014atlas, randle2019atlas) and contribute to national biodiversity assessments, for example the State of Nature report (hayhow2019state) and government biodiversity indicators (defra2020).
Efficient model-fitting is increasingly important with the ongoing growth in the volume of biological recording data, partly due to increasing participation through new technologies and platforms for data submission (august2015emerging). Fast inference is also motivated by the increasing desire to update species’ trend estimates frequently, in order to inform the measuring and reporting of biodiversity change. The new model of this paper responds to the need for a computationally efficient approach to analyse presence-absence data arising from a large number of sites, whilst accounting for spatial autocorrelation and which accommodates species with sparse records. The key of our approach to achieve speed and efficiency is to cast the detection and occupancy process as multivariate logistic regressions, which enables us to rely on a well established literature for posterior inference in multivariate regression models. Moreover, we model the autocorrelation in spatial and temporal effects using the GP prior, which is a well known framework for which specific approximations are available if a large number of points is considered.
1.2 Current models
One popular form of model describes dynamic occupancy; see for example royle2008hierarchical and Strien2013. This model is designed for data from several years and incorporates parameters representing colonisation and extinction. It is therefore mechanistic, with parameters which may assist in the understanding of spatial and temporal changes in distribution. The basic model may be extended, for example to allow temporal development to depend upon the status of neighbouring sites; see Broms2016.
Typically, these informative, complex models are designed for relatively short studies with small numbers of sites. Both Bayesian and classical inference methods have been used, in the latter case using unmarked in R (unmarked); Bayesian inference is discussed in kery2021AHM2. However for large numbers of sites and occasions computing times can be excessive (see Strien2013), and two other approaches are in current use in these cases.
The first, due to outhwaite2018prior, uses a random walk to describe the changes in a temporal effect of a logistic model; see chandler2011 and link2005. The random-walk model is fitted using Bayesian inference, and is implemented in the Sparta package, available from
This model has been extensively used to describe distribution data for large numbers of species. In particular, long-term measures of occupancy change were made for typically less-well studied taxa in the UK, including invertebrates, bryophytes and lichens (outhwaite2019annual; Outhwaite2020) and pollinators including wild bee and hoverfly species (Powney2019). For such wide-scale data sets the model can be extremely slow to fit, even when powerful clusters of computers are employed. For example, outhwaite2019annual comment that analyses of “large data sets with many species took several weeks”, despite using a large supercomputer. This issue is non-trivial, as it severely limits the wide-scale use of the method, and hinders model checks such as goodness of fit and prior sensitivity, and model comparison. It might also constrain the numbers of simulations run in MCMC. For example outhwaite2019annual reported that it was not possible to run all models to complete convergence due to time constraints.
The second approach is to use static models, in which a simple occupancy model is fitted to the distribution data for each year separately and the site occupancy probability for each year is described by a logistic function of site-specific covariates. This approach was proposed byDennis:2017, and is fitted using unmarked (unmarked) and classical inference. By contrast with dynamic and random-walk models, the static model is appreciably faster in execution, as the data from each new year of study can be fitted without reference to data from earlier years. The static model has been used for example in randle2019atlas and fox2021state.
However, a drawback of analysing the data from each year separately arises regarding records from early years, which may not be sufficiently numerous to allow the fitting of a static model in those cases. Similarly producing occupancy trends for rare or less well-recorded species may not be possible using the static model. As discussed in Dennis:2017, one could in this case fit the same “static” model to the data from all years, though that would increase the computational load.
1.3 Paper Outline
The new model of the paper is presented in Section 2. Section 3 describes inference, including the Pólya-Gamma scheme for model fitting. Section 4 evaluates the model through simulation, including comparison with the method of outhwaite2018prior implemented in Sparta. Section 5 applies the new model to two illustrative data sets on UK butterflies. Section 6 discusses possible extensions and the paper ends with Discussion in Section 7. Technical MCMC details are covered in the Appendix and additional results are provided in Supplementary materials.
2 Model: Bayesian framework and Gaussian processes
The proposed model builds on the existing approaches of outhwaite2018prior and rushing2019 and extends them by modelling autocorrelation across both time and space using the GP framework. The advantage of using a GP is that particular computational techniques are available to perform joint inference and adopt specific approximations suited for our case, which are well developed in the literature (quinonero2005unifying). Additionally, since we cast the occupancy and detection process within a logistic regression framework, we can take advantage of the efficient Pólya-Gamma augmentation scheme (polson2013bayesian) for inference, which is well-established in the Bayesian literature (linderman2015dependent; holsclaw2017bayesian) but not in the ecological modelling literature, with some recent exceptions (clark2019efficient; griffin2020). We have implemented our model in the package FastOccupancy, available on Github
In this section we introduce a slightly different notation from that generally employed for existing occupancy models, since it enables us to express the data naturally within a regression framework. For any species, observations are collected at sites and across years. A number of observations may be collected at each site and year. This number, which does not need to be defined for the purposes of the model, does not have to be the same for all sites or years and can be equal to 0 for particular pairs of sites and years. We refer to the unique pairs of sites and years with at least one observation as sampling units and we index them by . If all sites are sampled in all years, then . We assume that the occupancy status of a site can change between years but not within, which is a standard assumption of similar models for multi-season occupancy data.
We introduce latent variables , indicating the occupancy status of sampling units, with if sampling unit is occupied and otherwise. We assume that each sampling unit is occupied with probability , that is, . We index the site and year of sampling unit by and , respectively. Finally, we denote by the location of site and by the time point of year . For example, if the data were collected in years , , and , and if sampling unit belongs to years , respectively.
We denote by the total number of observations across all sampling units and we define , , to be the outcome of observation , that is, if the species is detected at observation , and otherwise. Finally, we introduce , which indexes the sampling unit of observation so that if observation corresponds to sampling unit , then . Therefore, if sampling unit is occupied then and otherwise . We account for the probability of a false negative observation but assume that false positive observations do not occur and hence assume that with the probability of detecting the species given presence.
We model the probability that sampling unit is occupied,
, on the logit scale and as a function of both fixed effects, such as covariates, and random effects (r.e.), and specifically r.e. that account for correlation between years, spatial autocorrelation between sites and individual variation of sites:
where is an intercept, is a r.e. for year , and are r.e. for site , and is the set of covariates for sampling unit . The site-specific random effects
are modelled as independent random variables, while the rest of the r.e. are defined below using GPs.
2.1 Gaussian processes
To define a distribution for the r.e. and , we introduce the concept of Gaussian processes (williams1996gaussian). We say that a function has a GP prior distribution if, for every combination of values , it holds that , where , , . Parameter tunes the overall variability of the GP, while parameter tunes the correlation between points. The points are called support points
. Although, in general, the GP is defined for a function with an infinite number of support points, in our case, we apply the GP on a function defined on a finite number of points, as we explain below, and hence this is simply equivalent to assuming a multivariate normal distribution on.
To account for correlation between years, we assume that the vector of year-specific r.e. (YRE)is distributed according to a GP with parameters and support points , which corresponds to assuming that ), with defined above. Similarly, the autocorrelated site-specific r.e. are distributed according to a GP with parameters and support points the locations of the sites, which corresponds to assuming that )
Finally, we model the probability of detection as
where is a year-specific r.e., on which we assume independent normal prior distributions, is the set of covariates for observation , and we note that according to the notation introduced above is the index of the year in which observation is collected.
The model used in outhwaite2018prior is a special case of our model when the autocorrelated spatial random effect is absent. Moreover, outhwaite2018prior use a random walk prior on of the form and . To show the differences in the prior specification between the random walk and the GP, we show the prior covariance matrices of in the two cases for . Assuming for simplicity we obtain
where is the covariance matrix using the GP formulation and
is the covariance matrix in the random walk case. We can see that the random walk prior has two undesirable properties. The first is that increasing variance is assigned to later years, since. The second is that the correlation between consecutive years is not constant across time, since
and hence the random walk prior does not have the stationarity property . Instead, as can be seen in , our GP formulation overcomes both of these issues.
2.2 Hierarchical structure
The following hierarchical structure completes the definition of our model, including the prior distributions of all parameters.
3.1 Pólya-Gamma scheme
We base our inference on the Pólya-Gamma (PG) scheme for logistic regression models (polson2013bayesian), which we briefly describe here. A random variable has a PG distribution, if , where . According to the PG scheme, given a set of observations , where , a Gibbs sampler scheme for is available by introducing a set of random variables , such that . More specifically, assuming prior distribution , the full conditional distributions used for the Gibbs sampler are:
where and . polson2013bayesian describe an efficient algorithm to sample a PG r.v. that does not require truncating the infinite sum in the definition of the PG distribution. Therefore, the PG sampling scheme enables us to employ a Gibbs sampling approach within a logistic regression framework, which is orders of magnitude more efficient than standard Metropolis-Hastings approaches.
3.2 Spatial approximation
Sampling at each iteration from the posterior distribution of the coefficients requires a computational complexity of at least because of the presence of the autocorrelated spatial random effects. When there is a large number of sites, such as in the case studies of this paper ( sites) it is computationally unfeasible to perform this operation at each iteration. Hence, we approximate the initial GP on locations by introducing another GP computed on a smaller number of support points , where , with respective values , and approximate each original value with the value of its closest support point . This approximation is known as the Subset of Data (SoD) approximation (quinonero2005unifying). In addition to gains in computational speed, if the points are chosen on a uniform grid, which divides the study area into squares of equal size, the matrix is less ill-conditioned, since site locations in near proximity tend to drastically decrease the conditioning number of the covariance matrix of a GP (rasmussen2006gaussian). When using this spatial approximation, each site spatial effect is approximated by the effect of the square in which it belongs.
4 Simulation studies
4.1 Computational time
Model performance is compared with that of the package Sparta. We have simulated data sets with number of years, with the number of observations per sampling unit , varying number of sites and we have compared the computational time required to run iterations with each package. The values chosen for the model parameters are given in the Appendix. The model formulation in Sparta does not include the autocorrelated spatial random effects, and hence we did not consider models with spatial autocorrelation in this simulation study. Results are presented in Table 1.
|Number of sites||FastOccupancy||Sparta|
As can be seen in Table 1, for sites, the algorithm implemented in Sparta is around 3.4 times slower than our algorithm. Computation time using our model increases roughly linearly with the number of sites, whereas in Sparta, the increase is steeper, and so for example when we consider sites, computation time with Sparta is almost 8 times longer than with FastOccupancy. To compare the quality of inference, we have compared the occupancy index estimated by the true model against the true occupancy index and the two models give very similar results. The comparison is reported in Table of the Supplementary material. Our results demonstrate that computational gains will be considerable for large-scale citizen science studies with tens or hundreds of thousands of sites and longer time series, with FastOccupancy making it possible to analyse very large occupancy data sets in reasonable time.
4.2 Spatial comparison
We performed an additional simulation study to demonstrate the importance of accounting for spatial autocorrelation when it is present in the data. We simulated data for sites in a sparse setting where each site is not visited every year and fitted two models: a model that accounts for spatial autocorrelation and individual r.e. and a model that only accounts for individual r.e.
Figure 1 presents the results of three different simulations, with different degrees of autocorrelation. The complete simulation settings are given in the Appendix. The true spatial autocorrelation is shown in the first column, whereas the estimates obtained from the models with and without spatial autocorrelation structure are shown in the second and third column, respectively. As can be seen in Figure 1, the first model is able to identify the areas where the spatial effect is higher/lower and provides a smooth representation of the spatial variation across sites. On the other hand, the second model results in a very patchy spatial pattern that does not identify areas of high spatial effects. Moreover, in Table of the Supplementary material we present the coverage of the estimated occupancy probability in each site according to the two models, where it can be seen that the model without autocorrelation has lower coverage than the full model ( for the model without autocorrelation and for the model with autocorrelation).
5 Case studies
We applied our new model to data for two UK butterfly species: Ringlet Aphantopus hyperantus and Duke of Burgundy Hamearis lucina. In doing so we demonstrate the performance of the new model for both a common, widespread species (Ringlet) and a rare, range-restricted species (Duke of Burgundy).
Butterfly data were collated through the Butterflies for the New Millennium (BNM) recording scheme run by Butterfly Conservation, using records collected between and , during which the database consisted of over 11 million records of UK butterflies (fox2015state). BNM data were restricted to records with an exact date and location of 1 km resolution or finer. For each of the two species, records were then filtered to months within which records of the focal species had been recorded, and observations of other species used to form detection histories (kery2010site). Thus for Ringlet, the data set featured 2 million records from unique 1 km squares (defined as sites), of which Ringlet had been recorded at sites from detections. Conversely the data set for Duke of Burgundy consisted of approximately 1.5 million records from sites (1 km squares), of which Duke of Burgundy had been recorded at sites from detections. On a machine equipped with an Intel Core firstname.lastname@example.orgGhz with GB of RAM, the model took hours to run on each dataset for burn-in iterations and iterations.
For both species, we considered the interactions between year and easting and between year and northing as covariates for occupancy probability, and for the detection probability we considered as covariates the relative list length and the first three powers of the Julian date. The relative list length is obtained by dividing the list length, which is the number of species recorded for a given site/date (szabo2010regional; Strien2013), by the maximum recorded list length in a neighbouring area of km. All covariates were standardised to have zero mean and unit variance. We do not consider the main effects for year or easting/northing, since the effects of year and space on the probability of occupancy are already accounted for in the processes and and therefore adding these main effects would lead to an overparameterised model. Finally, we employ the spatial approximation defined in Section 3.2 with squares of 20 km width. We report the results obtained using km wide squares in the Supplementary material, where it is seen that the estimates are robust to the choice of the approximation step size.
For each species, we calculate the yearly occupancy index (Dennis:2017) at each MCMC iteration using , where is the occupancy probability at site and year for iteration . Posterior summaries of the occupancy index for both species are shown in Figure 2, and support previous findings suggesting that Ringlet has increased in occurrence since 1970, whereas Duke of Burgundy has seen a reduction in occurrence (fox2015state). The indices for both species show increasing precision with time, reflecting an increase in underlying recording effort (Dennis:2017), which is also a feature for other taxa (isaac2015bias).
Top row: 95 % posterior credible interval (PCI) of the occupancy index for (a) Ringlet and (b) Duke of Burgundy. Bottom row: 95% PCIs of the year-specific detection probabilities at the average value of the list length covariate for (c) Ringlet and (d) Duke of Burgundy. The dots represent the posterior medians. Note that different scales are used for the two species.
The estimated occupancy probabilities for the two species are mapped over space for selected years in Figure 3. Note that the map for the Duke of Burgundy has been zoomed in to the part of the country where the species can be found, due to its restricted range. These patterns are consistent with what is known, namely that Ringlet has been expanding in range and now occupies most of the UK, with the exception of Northern Scotland and a small portion of northern England, whereas Duke of Burgundy has been contracting in range and can now only be found at a very small number of locations. Figure
of the Supplementary material presents maps of posterior standard deviations.
Ringlet has been shown to have increased in both range and abundance (fox2015state), which is a likely response to recent climate change (mason2015geographical). Duke of Burgundy is one of the UK’s most threatened species (fox2011new), with long-term declines in both abundance and distribution (fox2015state), but as seen in Figure 2 the decline in occurrence appears to have stabilised in more recent years, which may be due to conservation efforts (ellis2012landscape).
Relative list length has a positive effect on detection probability with 95% PCI and for Ringlet and Duke of Burgundy, respectively. The PCIs of the year-specific detection probabilities are shown in Figure 2. Interestingly, detection probabilities for Ringlet appear relatively stable over time, whereas estimated detection probabilities for Duke of Burgundy may have increased slightly, possibly due to increases in recorder effort to observe this rare, but also diminutive, species. In Figure 4 we show the posterior summaries of detection probability at each time of the year, , for both species, where it can be seen that the detection probability is extremely low outside the summer months. However, it is important to consider that in our model we assume that occupancy status of sites does not change during a year, even though butterflies obviously do not fly throughout the year. Therefore, the probability of detection at time in our model can be interpreted instead as the product , where is the probability of butterflies of the species flying at time and is the probability of detecting at least one butterfly of that species, conditional on the species flying at time , with the latter usually considered as the detection probability.
Convergence has been checked by monitoring traceplots from single chains, which we have reported in the Supplementary material.
5.1 Goodness of fit
The resulting goodness of fit plots for both data sets are shown in Figures 5 and 6. Figure 5 shows that the model properly accounts for the variation across years for both species. It is worth noting that we also ran the model with a constant detection probability across years and the PCIs of did not always contain the true values, suggesting that the fit of the model is not as good in that case. We show plots of the goodness of fit for the model with constant detection probability in Figure of the Supplementary material. The lack of fit of is likely a suggestion that detection probability exhibits variation across space as well as time. However as commented earlier, we do not model variation of the detection probability across space since we already model spatial variation of the occupancy probability, and modelling the spatial variation of both quantities could lead to unidentifiability issues between the two. We note that using list length instead of relative list length causes a bias in the goodness of fit and leads to the the number of detections in the north being consistently underestimated. The cause of the bias is that since fewer butterfly species inhabit the North of the UK, observers in the North are penalized with respect to the ones in the South as it is more difficult further north to detect a large number of species, and hence their capabilities are underestimated compared to observers in the South.
6 Potential extensions
We model temporal and spatial r.e. as additive independent effects, as shown in Eq. (1). To allow for interaction between time and space a possible approach would be to define a GP prior jointly over time and space in the following way. Formally, we can introduce r.e., , where is the r.e. for year and site , and assume a GP prior distribution with support points , such that , where depends on the distance between the time-space points and . We note that our additive modelling approach arises if . We do not make this choice in our application since we model an extremely large number of sites and this would require us to infer coefficients, but in the case of a smaller number of sites, or if a further approximation is used, this type of modelling approach can be used to obtain a joint model over space and time.
In addition to the optimal mixing properties, another advantage of the PG scheme is that it allows efficient variable selection, as performed in griffin2020
, since the logistic regression equations for the detection and presence processes can be cast in the linear regression framework using the PG augmentation. Although not considered in this paper, which aims to introduce the new modelling framework, we note that this approach can be used to perform variable selection on the occupancy and detection probabilities if a number of covariates are available as potential predictors for either of the two processes.
Finally, we note that the PG scheme is easily parallelisable with respect to the variables in (3), which would bring further computational advantages for large data sets.
The new model improves upon Sparta, which is a special case, in several ways: it is appreciably faster, making occupancy modelling for large data sets feasible for scientists lacking access to high speed computing. As shown in Section 2.1, the new model has a better serial covariance structure, compared with Sparta. As we see from the Supplementary material simulation comparison between the two models, this feature is apparently relatively unimportant when there is a substantial amount of data. However it would be relevant for smaller data sets. The new model incorporates both time and space, and the results for the two case studies are in accord with what is known for the species involved. The spatial maps of Figure 3 possess the same attractive feature as the spatial maps of Dennis:2017, in demonstrating how distribution of species changes over time. We have illustrated how goodness-of-fit can be routinely studied. It was interesting to note the differences between the magnitudes of detection probability for the two species, and this highlights the potential of using the model for further investigation of this poorly understood aspect of citizen-science occupancy modelling. Features such as this can now be explored which has not been possible previously, and there is now much potential for increased understanding and hypothesis generation.
As with all models, several assumptions are made on how the probabilities of species’ presence and the probability of detection vary across sites or years. The validity of results will depend on how realistic these assumptions are and the general appropriateness of the model for the data at hand.
We are very grateful to all of the volunteers who have contributed to the Butterflies for the New Millennium project, which is run by Butterfly Conservation with support from Natural England. We thank Richard Fox for his comments on the manuscript. Byron Morgan was supported by a Leverhulme research fellowship. This work was partly funded by NERC grant NE/T010045/1.
Appendix A Mcmc
To make use of the PG scheme, we augment the parameter space with variable and and we sample from the posterior distribution of the parameters
alternating the following steps:
Update conditional on
We can express the model (1) for as a multivariate logistic regression, by writing
where has a in position and in the rest of the row and has a in position and in the rest of the row. Therefore, we can perform inference jointly across the parameters conditional on using (4).
We note that the matrix multiplication required in (4) is performed in principle in operations, where is the number of columns of . However, we can perform this calculation in operations by taking into account the sparse structure of and . For example,
and hence , and can be computed in respectively , and , where is the number of covariates in . Similar considerations hold for rest of the matrix product products appearing in .
Update conditional on
Even though the PG scheme allow us to sample also jointly with the other parameters, this computation requires operations because of the matrix inversion, which is unfeasible as . Therefore, we sample each independently conditional on the rest of the model parameters. Conditioning on , the posterior distribution for follows a normal distribution.
Although it is straightforward to sample the hyperparametersby performing a Metropolis-Hastings step using the full conditional
, we have noticed that this naive procedure causes extremely slow mixing and in some cases it prevents the Markov chain of the hyperparameters from converging. Instead, taking advantage of the PG scheme, we can sample from the posterior of the hyperparametersby integrating out the time random effects b thanks to the introduction of the PG random variables, and therefore sampling conditional on the occupancy states and the rest of the regression parameters . The marginal distribution is presented in the Appendix.
The posterior distribution of can be written as . The calculation of the density requires the inverse of the matrix , which is computationally intensive to perform at each iteration. For this reason, we perform inference on by performing a grid search on a finite number of values. This allows us to precompute the inverse of the matrix at the specified values of , considerably speeding up calculations.
can be sampled from the full conditional as
Similarly to , can be updated using the PG scheme expressing (2) as a multivariate logistic regression.