Modeling the marked presence-only data: a case study of estimating the female sex worker size in Malawi

by   Ian Laga, et al.

Continued fine-scale mapping of HIV/AIDS populations is needed to meet goals set by global organizations like UNAIDS and WHO. Since key populations like female sex workers (FSW), men who have sex with men (MSM), and people who inject drugs often have higher prevalence of HIV, it is of interest to map these specific key populations. However, key populations are typically difficult (or impossible) to map directly due to social stigma and legal issues, among other reasons. Instead, targeted surveys coupled with auxiliary data can be used to estimate population size with reasonable confidence intervals. One such study collected FSW data by interviewing a subset of venues believed to house FSW and obtained accurate estimates of district-level venue counts. These venues represent a spatial marked presence-only data set. No existing method can estimate the FSW abundance at a fine resolution. Thus, a calibrated-Bayesian model was developed to estimate the prevalence at a fine-cell-level, extending the presence-only literature. The number of FSW at approximately 1.5 × 1.5-km resolution and the corresponding credible intervals are estimated, providing a method for treating HIV in these key populations.



There are no comments yet.


page 16


A Joint Spatial Conditional Auto-Regressive Model for Estimating HIV Prevalence Rates Among Key Populations

Ending the HIV/AIDS pandemic is among the Sustainable Development Goals ...

A Bayesian hierarchical modeling approach to combining multiple data sources: A case study in size estimation

To combat the HIV/AIDS pandemic effectively, certain key populations pla...

Partial identification and dependence-robust confidence intervals for capture-recapture surveys

Capture-recapture (CRC) surveys are widely used to estimate the size of ...

A Bayesian Evidence Synthesis Approach to Estimate Disease Prevalence in Hard-To-Reach Populations: Hepatitis C in New York City

Existing methods to estimate the prevalence of chronic hepatitis C (HCV)...

Multilevel and hierarchical Bayesian modeling of cosmic populations

Demographic studies of cosmic populations must contend with measurement ...

Estimation of the size of informal employment based on administrative records with non-ignorable selection mechanism

In this study we used company level administrative data from the Nationa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The work in this paper is motivated by the study of the HIV/AIDS epidemic, which is especially a public health threat in Sub-Saharan African countries. Eastern and Southern Africa account for 45% of HIV infections and 53% of individuals living with HIV globally. The 90-90-90 treatment target set by UNAIDS aims to achieve three goals by the year 2020: 90% of all people living with HIV know their HIV status, 90% of all people with diagnosed HIV infection will receive sustained antiretroviral therapy, and 90% of all people receiving antiretroviral therapy will have viral suppression (joint201490). While saving lives has been very successful, the number of new HIV infections was not falling fast enough to meet the 2020 goal (unaids2018data). New HIV infections were reduced by only 18% from 2010 to 2017, much lower than the 2020 target of 75%. In order to decrease new HIV infections, it is first necessary to understand where those with AIDS live. A very large portion of new HIV infections occur in key populations like sex workers, men who have sex with men, and intravenous drug users. Female sex workers (FSW) is one of the largest groups and are 12 times more likely to have HIV than the general population (unaids2014gap). One study of 16 sub-Saharan African countries found the HIV prevalence for FSW above 37% (unaids2014gap).

FSW populations are difficult to map directly because they represent a relatively small part of the population and have an incentive to hide their status. Small area estimates often suffer from large uncertainty due to small sample sizes. The University of North Carolina performed on-site surveying of venues believed to house FSW to estimate district-level FSW size in Malawi, which is by far the most detailed venue level data (PLACEreport)

. Their team published the Priorities for Local AIDS Control Efforts (PLACE) report, which documents how venues were discovered and visited and how FSW size estimates were calculated using probability weights. Prior to surveying venues, the PLACE team obtained accurate counts of the total number of venues in each district. By mapping venues instead of FSW directly, the PLACE team can reach a much larger portion of the FSW population.

In an ideal case, regression models to map population sizes are built on presences and absences: records of both where people live and where no people live. Records that include both pieces of information are known as “presence-absence” or “used-unused” data (pearce2006modelling). Regression methods can typically be applied directly to presence-absence data without much trouble. However, recording absences can be difficult, leading to presence-only data, which is data where only coordinates of observed sites are recorded, offering no information of presence for other locations. One of the most popular methods for fitting presence-only data is the Maxent model (phillips2006maximum), but there are many different approaches to modeling presence-only data, so we recommend pearce2006modelling for a nice introduction. Most of the original presence-only literature used a “pseudo-sampling” approach, which samples background observations and assumes they are absences. Combining the pseudo-absences with the observed presences created a pseudo-presence-absence data set that could be modeled with more traditional techniques. Furthermore, the models all limit inference to binary presence-absence estimates, rather than any measure of cell-level abundance. Later researchers developed point process models for the data which overcame many of the existing problems. See renner2015point and banerjee2014hierarchical for some examples. Given the lack of information about absences, modeling abundance given presence-only data is another challenge. To the best of their knowledge, pearce2006modelling knew of no application modeling abundance given presence only. ward2009presence later developed an EM approach which can estimate abundance given presence only when an estimate of population prevalence is available.

While most of the above methods consider only the presence or absence of an observation, the observations may be marked points. For marked point processes, the observations contain additional covariate information. For example, the locations of trees constitute a point process, and the height or width of the trees are “marks.” banerjee2014hierarchical offers a nice chapter on marked point processes.

We propose a Bayesian model-based approach that can map abundance given the total number of observations. Our proposed model is flexible and can account for the following data properties: cells can contain more than one observation, abundance can be estimated given the total number of observations, random effects can be included to reduce heterogeneity and account for systematic differences between areas, and each observation can be a marked point. Very few of the existing methods can estimate abundance. The EM algorithm in ward2009presence can estimate abundance, but cannot easily be extended to include random effects, and perhaps more importantly, can only handle binary presence-absence observations. None of the point process approaches have a way to incorporate the total number of observations or any measure of abundance. Our proposed model relies on pseudo-absences and is similar to the hierarchical models because the cell-level approach is a approximation to a point process model. The approximation both alleviates computational burden as well as makes it easier to incorporate the district-level venue counts. The Bayesian formulation is also desirable. First, informative priors based on previous studies can be included. Second, Bayesian computation allows us to include a calibration step which adjusts the intercept in the abundance model while still obtaining reasonable interval estimates for parameters and cell-level estimates.

In this article, we map FSW size at fine-scale cells by estimating the number of venues at the cell-level (venue count) and the average FSW size at the venue-level (venue size). The estimated number of FSW at the cell-level (FSW size) is the product of the estimated venue count and the estimated venue size for a given cell. For the PLACE data, the locations of venues constitute the point process, and the venue size represent the marks. The venue count form the presence-only data set, where cells with no venues are unknown. The PLACE report provides the total number of venues in each district.

The paper is organized as follows. Section 2 introduces the data and their special features, and how to handle the spatial misalignment issue. In Section 3, we introduce our choices of distributions and a calibrated Bayesian presence-only approach for modeling the FSW size at the cell-level. We apply the method on the entire PLACE data in Section 4, providing cell-level and district level estimates of FSW size, including credible intervals. We validate our approach on a complete presence-absence subset of the PLACE data in Section 4.3. Finally, Section 5 covers concluding remarks and future work.

2 Data and Special Features

We combine multiple data sources in order to estimate the FSW size at a fine-scale grid. Our responses (FSW size at visited venues and the venue locations) comes from PLACE, while the predictors come from a combination of environmental, demographic, and health data. Subsection 2.1 further introduce the PLACE data. In subsection 2.2 we explore the auxiliary variables we use as predictors of FSW size. Data pre-processing and spatial alignment is handled in subsection 2.3.

2.1 PLACE Data

The Priorities for Local AIDS Control Efforst (PLACE) is funded by the United States Agency for International Development (USAID) and the United States President’s Emergency Plan for AIDS Relief (PEPFAR) to understand the HIV epidemic and help reach the 90-90-90 target (measuresource). PLACE was developed to address the local behavior of HIV transmission. In 2016, The University of North Carolina implemented the PLACE I study in Malawi, surveying five districts (Lilongwe, Blantyre, Mangochi, Machinga, and Zomba) and one city (Mzuzu). Through additional funding, the PLACE II study was implemented in 15 additional districts, including the rest of Mzimba, the district for Mzuzu. The four main objectives enumerated in the PLACE report (2018) are: (1) to conduct programmatic mapping in selected district to identify venues where key populations can be reached, (2) to estimate the size of key population in each district who can be reached at venues, (3) to characterize HIV service coverage indicators for HIV programs reaching key populations, (4) in a subset of district, to survey and test members of key population groups. Figure 1 is a map of the visited venues in the Lilongwe district (PLACEreport). We can see that venues are highly clustered and typically lie along major roads or near the city center. In order to achieve the first goal of PLACE, the team interviewed community informants to form a list of venues which were believed to host FSW, cleaned an extensive list of venues, and visited a subset of these venues. Interviews with knowledgeable personnel at each venue provided estimates of the number of FSW present at each venue. Only visited venue locations were recorded, forming the presence-only data. In PLACE I, the original plan was to visit all identified venues. However, due to time constraints, a convenience sample of venues was visited. In PLACE II, venues were divided into groups based on priority, and venues were sampled from each group via a predetermined probability structure. The complete methodology for both PLACE reports is recorded in the PLACE Report Malawi (PLACEreport).

Figure 1: An examples of visited sites in Lilongwe from the PLACE report. Colors indicate whether any HIV prevention was available at the site in the last 6 months. HIV prevention activities include condom availability, HIV testing, and similar activities. Green sites reported always, orange reported sometimes, and red reported never.

2.2 Auxiliary Variables

Using the auxiliary variables enables us to map a more complete picture of Malawi key populations. In the PLACE report, venue weights were used to extrapolate the FSW size to the district level (PLACEreport). The proportion of venues visited and operational were used to calculate the venue weights. This venue weighting assumes that unvisited venues behaved similarly to visited venues, ignoring the location and local characteristics of unvisited venues. We can improve FSW size estimates by employing a regression-based approach with auxiliary data from multiple sources which accounts for the local characteristics of all venues. The auxiliary variables that we use are the demographic and health survey (DHS) data, night-time lights (nightlight), and WorldPop.

2.2.1 DHS Data

The DHS data are the results of a survey of 26,361 household in Malawi where respondents were asked questions like annual income, age at first sex, and number of sexual partners. Respondents were surveyed between October 2015 and February 2016. DHS divided households into 850 “clusters,” which represents a group of household that are close in proximity. DHS perturbed the cluster locations to ensure the privacy of respondents. We treat the recorded cluster locations as the truth since the effect of the perturbations will have a much smaller effect on the estimates than the variability of the venue counts and the FSW sizes. As the data contains thousands of questions, after consulting with relevant experts, we created an initial list of covariates that we believed would be useful in modeling either the venue count or venue size. DHS clusters (as well as WorldPop and nightlight data discussed later) are shown in Figure 2.

Figure 2: Illustrative maps of the auxiliary covariates. (a) shows which clusters are urban and rural. (b) shows the distribution of the mv191 (combined wealth index factor score). (c) and (d) show the scaled WorldPop and nightlight cell estimates, respectively.

2.2.2 Nightlight

The most popular FSW venues likely produce light during the night, so a measure of nightlight intensity should be a helpful predictor of both venue count and FSW size. Thus, as a predictor, we use nightlight data recorded by the Defence Meteorological Satellite Program (DMSP), now a part of the National Oceanic and Atmospheric Administration (nightlight). The nightlight data measures the low levels of visible-near infrared radiance via satellites. The data was recorded at a very fine scale, resulting in locations of recordings from polar orbiting satellites twice a day. We use 2016 data, aggregated by month. Month averages are then averaged, yielding annual average nighttime values at the locations across Malawi. The scaled log-cell-level estimates are shown in Figure 2.

2.2.3 WorldPop

Areas with higher populations should intuitively have more venues, and potentially more FSW per venue. In order to capture this relationship, we use the 2015 adjusted WorldPop data set, which adjusts the population estimates to match UN national estimates. WorldPop estimates the number of persons per grid square at approximately 100 square meters cells (worldpop). The scaled log-cell-level estimates are also shown in Figure 2.

2.3 Combining the Data

The venues in Malawi represent a marked point process since the observations contains the longitude, latitude, and venue size estimate of visited venues in Malawi (see banerjee2014hierarchical for an introduction to point patterns). However, the response and predictors exist on different spatial levels. We approximate the venue count point process by dividing Malawi into cells and counting the number of venues in each cell. The cell-level approximation makes computation easier and helps incorporate district-level venue counts in the model. For the venue size, the data is modeled at the venue locations. Using the actual venue locations requires only aligning the predictors to the venue locations rather than aligning both the predictors and the venues to a new cell-level location. This subsection discusses our approach for aligning the responses and the covariates.

2.3.1 Cells

We model the venue count at the cell-level and must choose the size of the cells. The cell size should be chosen so that most cells with zero venues are actually zeros, but not so large that homogeneity within the cells is violated too much. To balance these two factors, we divided Malawi into cells of approximately km. The sizes of the cells vary between 2.318 km and 2.392 km, which has a negligible effect on the estimates. The small differences in area arise because the distance between each degree of latitude depends on the distance to the equator. We discarded cells that had zero population according the WorldPop data since these cells would necessarily have zero FSW size and we want to limit our inference to those cells with positive populations. This process leaves 39,312 cells, of which 380 contained PLACE II venues. We describe our spatial kriging approach for estimating covariates at these cells and venues in Section 2.3.2.

2.3.2 Spatial Misalignment

Our goal is to align the all predictors and responses to the same spatial level. For the venue counts and venue size, the data is aligned to the cell-level and venue-level, respectively. This issue is known as spatial misalignment, and has been discussed in detail. See mugglin2000fully, gelfand2001change, young2009linking, carlin1999spatio, banerjee2014hierarchical.

The WorldPop and nightlight data are recorded at a much finer scale than our Malawi cells, thus cell-level values are calculated by averaging the WorldPop and nightlight points which lie inside the cell and are considered to have no measurement error. For the DHS data, many methods are available which account for the spatial variability of the covariates over the cell (e.g. gotway2007geostatistical). Since our cells are relatively small, accounting for variability will have a negligible effect on the estimates. We instead follow the simpler approach suggested by carlin1999spatio.

We aligned the DHS averages to the cell-centroids using ordinary kriging. Kriging provides variance estimates at new locations which quantify the uncertainty of measurement error in our model.

carlin1999spatio assume a conditionally autoregressive prior on the true covariates , i.e. , and , where are the cell-level kriging estimates and are the variance estimates in cell from the kriging procedure. The CAR prior is natural in their paper because they are working with large ZIP code areas. We are modeling small, sometimes isolated cells, so we instead assume a Gaussian Process prior, i.e.



represents the vector of parameters defining the Gaussian Process. We impose a Matérn covariance function. The vector

is estimated via restricted maximum likelihood individually for each DHS covariate.

For the venue size, we krige the covariates to the individual venues locations, rather than cell centroids. For WorldPop and nightlight, the venues inherit the values nearest to in which they lie. The spatial resolution of WorldPop and nightlight are very precise, so these values accurately represent the venue-level values. Once all covariates are on the same spatial scale, the models can be fit using regression, incorporating the measurement error of the kriged covariates.

3 Methodology

In this section, we introduce the approach for modeling the venue size (number of FSW at the venue-level) and venue count (number of venue at the cell-level). The venue locations comprise a presence-only data set. There are many different approaches to modeling presence-only data, so we recommend pearce2006modelling for a nice introduction. Our model is an adaption of the pseudo-absence approaches to model presence-only data, which sample new locations in the domain and assume these locations are absences. In this paper, we sample pseudo-absences according to the strategy in ferrier2002extended.

We make the assumption that given the covariates for a cell, the venue count and the venue size are conditionally independent. We show the validity of this assumption in Subsection 4.4. The details of our approach to modeling presence-only data when given the total number of venues in each district are introduced next.

Predicting cell-level venue size is done in two steps: (1) fit a zero-inflated log-normal model on the observed venue sizes and (2) predict the venue size to all cells using the fitted model. Predicting cell-level venue count is done in five steps: (1) scale the observed cell-level venue counts by the proportion of sampled venues in each district to account for district-level sampling bias, (2) sample the pseudo-absences, (3) fit a zero-inflated negative binomial model on the scaled venue counts, (4) calibrate the model, and (5) predict the venue count to all cells using the fitted model. After modeling venue size and count, cell-level FSW size is estimated by multiplying the venue size and venue count predictions. Further details are found in Section 3.2 for venue size and Section 3.3 for venue count.

3.1 Models

Here we summarize the modeling choices and further specify our choices of priors in the Bayesian framework. Both the venue counts and the venue sizes seem to follow zero-inflated distributions. The general idea for zero-inflated models is that one model is responsible for determining the presence of the response and another model determines the abundance of the response, given presence. One approach to modeling zero-inflation is to separate the zeros and the counts entirely, imposing a strictly positive count distribution for the positive counts. This hurdle model was first discussed in Cragg’s seminal paper cragg1971some. lambert1992zero then introduced a model where the zero could be a result of both the Bernoulli model and the count model. We use the zero-inflated model described in lambert1992zero to model both the venue counts and the FSW size for theoretical reasons. The zero-inflated distribution is given by


where is the probability of excess zeros and the choices of are motivated by predictive performance and are specified in Sections 3.2 and 3.3. Note that for continuous , there is no distinction between zero-inflated and hurdle distributions.

3.2 Venue Size

We can directly model the average number of FSW at the venue-level (venue size) using both PLACE I and PLACE II data. We chose a hurdle log-normal model for the average number of FSW at the venue-level, i.e. in Equation (2) is given by , where represents the observed venue sizes with corresponding covariate matrix

. We use a logit link for

. Given the covariates, we assume the linear regression models:


where denotes the cell number, denotes the district for cell , and are subset of , and are the random district effects. We impose weak priors on the regression coefficients, where , , , , . For the district random effects, we impose normal priors, i.e. , where the argument is either for or . Additionally, .

We chose to model venue size via a hurdle log-normal distribution based on both visual diagnostics and leave-one-out (loo) cross-validation. The hurdle log-normal distribution offered much better predictive capabilities and visual diagnostics than a zero-inflated negative binomial distribution, despite the venue sizes being counts.

3.3 Venue Count

For the venue count, the number of venues at the cell-level are first scaled up by the sampling probability of each district since the proportion of sampled venues is different for each district, i.e.


where represents the observed number of sampled venues in cell , represents the proportion of venues sampled in district , and denotes the nearest integer function. Rounding allows the counts to be modeled via count distributions, which outperform the log-normal distribution with respect to predictive diagnostics. This approach assumes unvisited venues lie in the same cells as visited venues. Because the locations of unvisited venues are unknown, there are no method to exactly recover the true venue counts. The scaling approach in Equation (5) is straightforward and offers a reasonable initial estimate of venue counts.

Next, pseudo-absences are sampled. Let and be the number of observed cells with venues and the number of pseudo-absence cells, respectively. Let represent the matrix of covariates corresponding to . is the covariate matrix for the pseudo-absences corresponding to , a vector of zeros. We form and by stacking with and with , i.e.


Following the procedure in ferrier2002extended

, the pseudo-absences are down-weighted to help estimate approximate degrees of freedom and significance levels and represent the predictions in terms of relative likelihoods. If

is constructed as above, the vector of weights is given by , where 1 is repeated times and is repeated times. We refer to this the combination of the observed count data and the pseudo-absences as the count data.

Using the count data, we can construct an initial model for the number of venues at the cell-level. We model via the zero-inflated negative binomial model in Equation (2), where


represents the density of the negative binomial family. We use a logit link for and log links for both and , the shape and mean of the negative binomial distribution, respectively. We model the probability of presence and the mean by


where denotes the district for cell and . Based on the data and loo cross-validation, a spatial error term is placed on the negative binomial model but not on the Bernoulli model. We impose the same priors on the regression parameters and the random effects. For the shape of the negative binomial distribution, we impose a gamma prior, i.e. . For the spatial term , we choose the exponentiated-quadratic kernel and impose a flat prior on and an informative inverse-gamma prior on as

, where the hyperparameters were estimated from the data.

We initially estimate all model parameters using the models described above, noting that the estimates for and are biased due to oversampling absences. prentice1979logistic showed that the estimates of the slope coefficients in a case-control study are consistent with the true coefficients, so we assume the estimates of the slope parameters are consistent. The intercept and random effects of the Bernoulli model are then calibrated for each posterior sample so that the predicted venue count equals the known venue count at the district-level. By performing this calibration, we convert the relative likelihoods to actual abundance estimates.

We provide the general procedure in Algorithm 1, but ignore the measurement error for readability. Let be the number of districts with presence-only data and represent the known venue count in district .

Set to the desired number of posterior samples after burn-in.
Sample posterior samples, after burn-in.
for each MCMC sample  do
       for each district  do
             Let , be the predicted venue count for all cells in district .
             Solve for
       end for
end for
Algorithm 1 Bayesian presence-only calibration

The superscripts specify that the parameter is from the ’th posterior sample. Thus, assuming conditional independence between venue size and venue count, the fitted value of the ’th cell-level FSW size is given by


and similarly for the standard error.

3.4 Variable Selection

Variable selection for zero-inflated models is nontrivial. There has been recent work in variable selection methods, including LASSO and one-step SCAD techniques (buu2011new) and a stochastic variable selection strategy (cantoni2018stochstic). A simpler strategy, suggested by NCSSnb is to start with a saturated occurrence and abundance model and remove one variable at a time, according to the largest p-value, until all p-values are below some pre-specified threshold. For Bayesian models, we instead use 95% credible intervals. This process creates a list of possible models which were then compared using loo cross-validation and visual diagnostics in order to choose the best model.

3.5 Predicting to New Districts

The random effects in the venue size model contain both PLACE I and PLACE II districts, which cover 19 of the 28 Malawi districts. The venue count model contains only 14 PLACE II districts. In order to predict both venue count and venue size, we need to obtain the random effects for missing districts.

The procedures for the venue size and the venue count random effects are identical and are broken up into two cases: (1) An estimate of the number of operational venues in the district is available, or (2) this estimate is not available. In case (1), the random effects can be found by calibrating the model so that the predicted venue count is equal to these estimates. In case (2), random effects are simulated using the estimates of the random effect variance, i.e. , where represents the estimated variance for the district.

4 Estimation of FSW in Malawi

Here we map the distribution of FSW in Malawi using our calibrated model. For the venue size model, we used all visited venues from both PLACE I and II studies. Due to the convenience sampling for venue locations in PLACE I, only PLACE II venues are used to model venue count, aside from Mzimba. Mzimba was removed because the largest city of Mzimba was visited in PLACE I, while only the rest of Mzimba was visited in PLACE II. Since there are estimated to be only about 2,500 operational venues in Malawi, less than 93.64% of the cells are true absences.

WorldPop values were considered on the log-scale. Since some nightlight values were negative, the values were first scaled by subtracting the minimum nightlight value and then taking the log of the values. The cell with the smallest nightlight value (now negative infinity on the log-scale) was assumed to contain zero FSW and was removed from the data before modeling. All covariates were then standardized to have zero mean and a standard deviation of one. For prediction and calibration, if cells were predicted to have less than 1/2 FSW, the prediction was forced to be zero. This threshold helps handle the thousands of cells that should realistically have zero FSW. All analysis was done using

R (rlang). Spatial kriging is done via the fields package (fields). Bayesian modeling is implemented using the Stan probabilistic programming language (stan) via the brms package (brms). Since the number of samples was fairly large, we fit an approximate Gaussian Process with 5 basis functions and a multiplicative constant of , as suggested in the brms package.

Table 1 summarizes the variables selected in each model, where represents the mean of the negative binomial distribution for the venue size, the mean of the log-normal for the venue count, and and

for mean of the Bernoulli distribution for the venue size and count models, respectively.

Predictor Label Meaning
mv035 Number of wives/partners N/A + N/A N/A
mv167 Times away from home in last 12 months - N/A N/A N/A
mv168 Away for more than one month in last 12 months - N/A N/A N/A
v155 Literacy N/A N/A + N/A
mv191 Wealth index factor score combined - N/A N/A N/A
mv201 Total children ever born N/A - N/A N/A
mv766a Number of sex partners in last 12 months + N/A N/A N/A
hivpos Percent of adults that are HIV positive - N/A N/A N/A
Built Built-up index. Low values represent rural areas N/A + N/A N/A
Travel Travel time to nearest city of 50,000 or more people N/A - N/A N/A
WorldPop 2015 population estimate - - + -
nightlight Night-time light activity + - + -
Table 1: Table of predictors in the final models. Districts are included as random effects in all models. The and indicate the direction of the parameter coefficient for each model, while N/A indicates that the parameter was not included.

4.1 Diagnostics

In this subsection, we include a few diagnostic plots and further justification for modeling choices. Diagnostics for mixture of distributions are inherently difficult. Some work has been done on diagnostic measures for the fit, including half-normal QQ-plots (moral2017half). Since we are in a Bayesian framework, we consider posterior diagnostics. Diagnostic plots for both venue size and venue count are shown in Figure 3. The scatter plots show the average predicted value over all posterior samples against the true observed responses. The structure in the scatter plots are typical of predictions from zero-inflated models. The venue size is more difficult to predict than venue count, especially for large venue sizes. This suggests the venue counts have high variability. Given similar covariates, some venues have very large FSW estimates while some venues are estimated to have zero FSW. The scatter plot for the venue count shows that the model does a good job of predicting venue count.

The hanging rootogram is helpful for diagnosing excess zeros and overdispersion. See kleiber2016visualizing for more details regarding rootograms. For the venue size, the rootogram shows the log-normal distribution is a good model choice, but the large negative bars clearly show that the response values suffered from respondents rounding their venue size estimates. However, the positive values balance out the rounded estimates. For the venue count, the observed bar for is hanging below zero, suggesting there are still excess zeros and the bar for is above zero, suggesting some overfitting to the lower counts. However, the observed counts are generally close to zero, suggesting that the zero-inflated negative binomial distribution does a reasonable job of modeling the data. Overall, the diagnostics look reasonable and confirm our modeling choices.

Figure 3:

Diagnostic plots for venue size and count. Subplots (a) and (b) concern the venue size while subplots (c) and (d) concern the venue count. (a) and (c) are scatter plots of average predicted values against the observed predicted counts. (b) and (c) are hanging rootograms and plot the square roots of the counts against the response variable. The observed counts are shown as light blue bars while the expected counts and corresponding uncertainty are shown as a black line with dark blue shadow.

4.2 Results

The district-level FSW size estimates are shown in Figure 4 as posterior densities. Stars indicate the PLACE Report estimates, but are not available for districts not surveyed in the study. Clearly, the estimated number of operational venues is integral to controlling the variance of the posterior distribution. Not all 95% credible intervals cover the PLACE report district estimates, although most estimates are very close. Note, however, that the PLACE estimates are not gold standards for the district-level FSW estimates.

For privacy reasons, we omit a cell-level mapping of the FSW in Malawi at the resolution used in the model fitting. Instead, we aggregate to much larger cells, but still smaller than the size of a district. Maps of predicted FSW, on the log-scale, and variance of FSW estimates on the log-scale are shown in Figure 5. Districts that were neither in the PLACE I or II study, like Ntchisi and Thyolo, have the largest variance estimates because both random effects had to be simulated.

Figure 4: FSW size estimate credible intervals for all districts in Malawi. Stars indicate the PLACE Report estimate, if available.
Figure 5: Maps of FSW counts in each cell in Malawi (a) and on the log-scale (b). The log of the variance estimates for FSW counts are shown in (c). Cells with estimated zero FSW are shown as gray.

4.3 Model Validation

In this section, we further explore the performance of our calibrated model. As noted in ward2009presence, it is difficult to test presence-only models because the truth is unknown. The PLACE team visited almost every venue in three districts, Mchinji, Mwanza, and Neno, which provides us with a test data set. We refer to these three districts as the complete districts. The number of observed venues against the number of operational venues for the complete districts are 98/102, 75/78, and 64/65, respectively. While the complete districts are not true presence-absence samples, they are nearly complete and can be used to validate the procedure.

We randomly remove 50% of the venues from the complete districts, build a model on the remaining venues and the venues from the other districts using the same distributions and covariates as before, and compare the predictions from the model to the true venue counts and size. We monitor the coverage of both the venue count estimates from the calibrated model and the FSW size estimates. For 95% credible intervals, we except to have 95% coverage. The venue count estimates achieve 98.55% coverage and the FSW size estimates achieve 97.59% coverage. The larger than expected coverage suggests that the credible intervals are mildly conservative for the complete districts, but are reasonably close to the expected coverage. Using the fitted values, we also calculated the Root Mean Square Error (RMSE) for the venue count and FSW size as 0.7219 and 9.4913, respectively.

4.4 Conditional Independence of Venue Size and Count

In order to show the independence of the venue size and venue count conditional on the covariates, we compare the residuals from the venue size and venue count models. zhang2018measuring

proved that under some weak conditions, the random variables

and are conditionally independent given if is independent of . However, since true-data is generally unknown, we limit our analysis to only the three complete districts and only the cells with at least one venue. Cells with zero venues are excluded because we cannot assume that the average venue count in that cell is zero and have no way of estimating it.

Residuals are calculated by subtracting the fitted values from the true values. The scatter plot of the venue size residuals and venue count residuals is shown in Figure 6. The residuals do not appear to have any strong correlation structure, suggesting the conditional independence assumption is reasonable.

Figure 6: A scatter plot of venue size residuals and venue count residuals.

5 Conclusion

In this paper, we have introduced a fully model-based approach for modeling marked presence-only data. The Bayesian implementation allows us to easily incorporate other data-related issues aside from presence-only, like measurement error and spatial dependency, while also seamlessly producing credible intervals. The idea of calibrating the Bernoulli intercept can, however, extend to frequentist models, although confidence intervals and other considerations may be more difficult to calculate.

The original purpose of this paper was to create a method for analyzing the PLACE data in a regression setting since no other method was available. However, the proposed methodology extends not only to other size-estimation problems but also generalizes to binary presence-only data where the total number of occurrences is known, as explored in ward2009presence. The model is fit in an almost identical way as in this paper, except we assume the data follows only a Bernoulli model, as opposed to the zero-inflated models. While the ward2009presence model likely outperforms our model for simple data, our model can include information in the form of priors, model random effects, and account for more complicated data properties, like spatial trends, measurement error, and random effects.

One potential extension is to add uncertainty to the total number of venues in each district. Instead of solving for the random effects such that the summation of estimated venues in the district equals a constant, the constant value could be simulated during each iteration. This approach does require prior information about the certainty of the total number of venues in each district, which may or may not be available to the researcher. The model could also be extended to model venue size and venue count jointly. However, there does appear to be very little correlation between venue size and venue count after accounting for the covariates and this would greatly increase the computation time. Another question that arises is whether interviewing more venues or recording the locations of unvisited venues offers better predictive power. For example, if the survey team can skip an interview at one venue in order to record the location of ten more unvisited venues, would the size of prediction confidence intervals increase or decrease? This is an interesting follow-up question that has real-world application for future sampling procedures. Lastly, improving the venue count scaling in Equation (5) could be the most important area of focus. Being able to more accurately recover the true venue counts in observed cells or removing the need for scaled venues would likely improve the FSW estimates.