Harmonizing Child Mortality Data at Disparate Geographic Levels

by   Neal Marquez, et al.
University of Washington

There is an increasing focus on reducing inequalities in health outcomes in developing countries. Subnational variation is of particular interest, with geographic data used to understand the spatial risk of detrimental outcomes and to identify who is at greatest risk. While some health surveys provide observations with associated geographic coordinates, many others provide data that have their locations masked and instead only report the strata within which the data resides. How to harmonize these data sources for spatial analysis has seen previously considered though no method has been agreed upon and comparison of the validity of methods are lacking. In this paper, we present a new method for analyzing masked survey data alongside traditional geolocated data, using a method that is consistent with the data generating process. In addition, we critique two proposed approaches to analyzing masked data and illustrate that they are fundamentally flawed methodologically. To validate our method, we compare our approach with previously formulated solutions in several realistic simulation environments in which the underlying structure of the risk field is known. We simulate samples from spatial fields in a way that mimics the sampling frame implemented in the most common health surveys in low and middle income countries, the DHS and MICS. In simulations, the newly proposed approach outperforms previously proposed approaches in terms of minimizing error while increasing the precision of estimates. The approaches are subsequently compared using child mortality data from the Dominican Republic where our findings are reinforced. Accurately increasing precision of child mortality estimates, and health estimates in general, by leveraging various types of data improves our ability to implement precision public health initiatives and better understand the landscape of geographic health inequalities.



page 9

page 10

page 12

page 14

page 16

page 17

page 24


Explaining the Decline of Child Mortality in 44 Developing Countries: A Bayesian Extension of Oaxaca Decomposition Methods

We investigate the decline of infant mortality in 42 low and middle inco...

Accounting for Spatial Anonymization in DHS Household Surveys

The household surveys conducted by the Demographic and Health Surveys (D...

A Probabilistic Model for Analyzing Summary Birth History Data

BACKGROUND There is an increasing demand for high quality subnational ...

Prediction of neonatal mortality in Sub-Saharan African countries using data-level linkage of multiple surveys

Existing datasets available to address crucial problems, such as child m...

Geographic and Racial Disparities in the Incidence of Low Birthweight in Pennsylvania

Babies born with low and very low birthweights – i.e., birthweights belo...

An Efficient Production Process for Extracting Salivary Glands from Mosquitoes

Malaria is the one of the leading causes of morbidity and mortality in m...

Estimation of Health and Demographic Indicators with Incomplete Geographic Information

In low and middle income countries, household surveys are a valuable sou...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While many have touted the importance of creating more precise metrics for directed health policies in low and middle income countries (LMIC)(Bhutta2016, ; Desmond-Hellmann2016, ), the appropriate approach to assess health variation and hot spots is still not agreed upon. Recently there has been an uptick in methods for estimating continuous risk fields for health and demographic outcomes(Wakefield2019, ; Diggle2019, ). Estimating continuous risk fields is appealing in a number of respects. Once inference is available for the field, one can aggregate to arbitrary administrative areas, making it an extremely flexible product for policy makers. In addition, data that differ in the amount of geographical information that are available can be combined, by building on the common latent spatial field. To date, however, there is not complete consensus on the best approach for continuous spatial risk estimation, in particular, how to combine disparate data types. This lack of consensus is especially problematic for LMICs where vital registration systems are rare. For these countries, the most representative and reliable, albeit sparse, source of health outcomes data often comes from household surveys(Alkema2014, ). For example, while many of the previous undertakings that have created continuous spatial risk fields for health outcomes have relied on the Demographic and Health Survey (DHS) for inference, whether and how to include other data sources into the analysis is debated. While the DHS provides a reliable dataset of over 300 surveys constructed with a multi-stage cluster household sampling design in over 90 countries, no one country has been surveyed more than five times in the 16 year period between 2000 and 2015, with many countries only being surveyed on two or fewer occasions(Burgert-Brucker2016, ; Gething2015, ). This limitation has led to several studies that go beyond the inclusion of solely DHS data, and incorporate other types of survey and census data that include measures of the same health outcomes. Including this data, however, has required researchers to come up with alternative modeling approaches as these data does not meet the traditional requirements for geospatial modeling(Golding2017, ; Reiner2018, ).

Even when we restrict ourselves to the domain of survey data collected by DHS and the comparable Multiple Indicator Cluster Survey (MICS), the traditional criterion for inclusion in geospatial analysis, and therefore estimation of continuous spatial risk fields, is that data must be geolocated. This usually means that data are accompanied by a set of coordinates, most often latitude and longitude. Henceforth, we refer to this type of data as point data. This limits the number of data points for analysis even further as the DHS and MICS do not always include information of a surveyed cluster’s coordinates, meaning that survey responses can not be associated with particular locations. Instead, only the stratification from which a particular cluster resides is given, usually an administrative region divided into its urban and rural sectors and henceforth referred to as masked data. Again, several studies have attempted to address this issue by introducing novel methods of dealing with estimating continuous spatial risk fields from data that are collected via cluster sampling but for which only the strata is reported(Golding2017, ; Reiner2018, ; Utazi2018a, ).

In this paper, we present a new analytic approach for dealing with masked data in the context of estimating a continuous spatial risk surface. The method we develop is consistent with the data-generating process, which is not true of other methods that have been proposed in the literature. We begin our discussion with a brief review of the data which will be used for the analysis in this paper, both for anchoring our simulation testing framework and also for our substantive application, which is estimating the under-5 mortality rate (U5MR) in the Dominican Republic (DR). We then present the new method for incorporating masked data into a geospatial analysis. We follow up with a review of other approaches that have attempted to combine point and masked data for this type of analysis and describe why they are flawed. We test the new method in a simulation framework where the true underlying continuous spatial risk field is known, sampling from the field in a number of different ways, and comparing the proposed and existing methods in their ability to recreate the underlying field. Lastly, we apply our method to estimating subnational variation in U5MR for the years from 2000–2014 in the DR using three DHS and MICS surveys.

2 Data

Data on child mortality in the DR are available from two DHS conducted in 2007 and 2013 as well as a MICS conducted in 2014. While the DR has had a functioning Vital Registration system for some time, its coverage has not been comprehensive (Mahapatra2007, ; Mikkelsen2015, ). Furthermore, while the census and the several in country surveys, including Encuesta Nacional de Hogares de Propositos Multiples 2009, have been conducted, only the DHS and MICS surveys have consistent subnational geographic information. DHS data for both years was collected to be representative at the province level, stratified by urban and rural areas, with a total of 31 provinces and one Distrito Nacional which functions as an independent district. Data was collected using women aged 15 to 49 as the target population. The 2002 and the 2010 census were used to create the sampling frames in terms of enumeration areas (clusters) for the 2007 and 2013 surveys, respectively. From these frames, clusters were selected within each of the strata. In each survey, a total of 63 strata were used, one for each of the 31 provinces urban and rural divisions and for the Distrito Nacional which only covers an urban population. From the potential 35,700 and 39,111 enumeration areas (EAs) in the country, 1428 and 524 clusters were selected for the 2007 and 2013 DHS, respectively(DHS2007, ; DHS2013, ). GPS coordinates are given for each cluster sampled in the surveys. Urban/rural cluster locations are displaced by up to 2 km/5 km with the locations of a further 1% random sample of rural clusters being jittered by up to 10 km. Figure 1 shows the spatial distribution of the clusters at the reported locations. As with other studies we will not consider the jittering further.

Figure 1: The points represent the (jittered) cluster locations for the 2007 and 2013 DHS. The labeled counts are of the number of masked clusters in each region, 10 regions in total, for the 2014 MICS.

The data extracted from MICS 2014 were collected to be representative of the 15–49 female population for the 10 administrative regions of the DR, stratified by urban and rural areas, for a total of 20 strata. EAs were used from the 2010 census and 2083 clusters were selected with 33,328 households targeted for surveys(MICS2014, ). For the MICS survey, all cluster locations were removed before the data was made publicly available. Only the information regarding the strata from which a cluster resided was provided as well as which households came from a particular cluster. We emphasize at this point that we are not considering the situation in which the risk (or total count) of a variable is available over a complete area (such as might be available from a census). Methods for this scenario have been considered previously(Wilson2018, ).

Our target of inference is the U5MR, and complete birth histories are collected from women aged 15–49 who reside within households surveyed for both MICS and DHS. Birth histories include all births for the woman being interviewed, the date of birth of the child, as well as the age in months of a child who died under the age of five. The retrospective nature of the data allow one to construct historical rates since older surveyed women provide information over the complete period in which they gave birth. For this analysis, only children born between 2000 and 2014 were retained for the final dataset. Each child’s birth data is divided up into 7 age periods, 0 to 1 month (neonatal), 1 to 6 months (post-neonatal period 1), 6 to 12 months (post-neonatal period 2), and 4 annual periods. Each child contributes a number of observations up to all the age groups through which they are observed. A child does not contribute to all age groups if they die before entering the final age group or if at the time of the survey the child has not reached the final age group. For further details see Wakefield et al. 2019(Wakefield2019, ). In our final data set the DHS 2007 and DHS 2013, contain 503,205 and 310,616 person periods, respectively, and the MICS 2014 contains 1,220,781 person periods.

3 Methods

3.1 Mixture Method For Spatially Masked Data

The data we consider are available as a mix of both point located and masked data. We briefly review the method used for point located data and then introduce our masked data model. For point data, we use a space-time model that has been previously described(Wakefield2019, ). The approach falls under the popular model-based geostatistics approach to estimating a spatial risk surface(Diggle2019, ). Under this approach there is an assumed latent continuous spatial field, which is taken as a Gaussian Process (GP), and the parameters of this field are estimated using likelihood or Bayesian methods. There are many ways to carry out computation for a GP model(Heaton2019, )

and here we follow the stochastic partial differential equations (SPDE) approach

(Lindgren2011, )

, in which the GP is approximated on a triangular mesh to give an efficient Gaussian Markov Random Field (GMRF) representation. To implement this approach one could use Markov Chain Monte Carlo (MCMC), but for spatial models this is challenging

(Filippone2013, ). Instead we use computational approaches that are based on Laplace approximation, namely the integrated nested Laplace approximation (INLA)(Rue2009, ) and Template Model Builder (TMB), both of which have R implementations(Lindgren2015, ; Kristensen2016, ). TMB is less familiar to a statistical audience, but has been successfully used in ecological settings(Thygesen2017, ; Thorson2015, ). We first specify a space-time model for a generic binary indicator for which

is the probability of an event at a cluster with location

and at time . The spatio-temporal model is:



are covariates with associated log odds ratios

and is the space-time model, which is a GP with a covariance function that is a separable process. Specifically, the covariance function is a combination of a spatial dependence structure, , and a temporal dependence structure, , of the form

The multiplicative structure is beneficial because it is easy to construct a valid spatio-temporal covariance function by combining valid spatial and temporal covariance functions. We choose the spatial component of the separable spatio-temporal model to have Matérn covariance function

where is the spatial range corresponding to the distance at which the correlation is approximately 0.1,

is the marginal standard deviation,

is the smoothness, and is a modified Bessel function of the second kind, of order . In our model, the Matérn spatial structure is approximated via a SPDE and combined with an AR(1) process in time. Inference may be carried out for this model using INLA or TMB. In both approaches, samples can be drawn from an approximation to the posterior distribution, in order to make inference about functions of interest. Suppose we observe counts, out of , with the location of data collection , known. Then, the above space-time model can be combined with likelihood, .

Now suppose we have data that has been sampled from a cluster with an unknown location, and label such masked data for cluster in strata , , for (so that is the number of strata). To summarize, within area we have clusters that do not have geolocations available. In this case, under the above binomial sampling model, the implied form is a mixture model, where we mix (average) over all possible unknown locations, say in strata (typically ). The likelihood is:


for clusters without location information in strata , and where is the probability of the cluster falling at location at time . To implement this method, the surface that we are estimating should be sufficiently discretized by what we refer to as cells, from which the cluster could have been drawn. Guidelines for how fine of a discrete surface should be created for this estimation process can be based on what population density data is available, the number of clusters that are known to exist in the strata (this information is often available in the survey reports) and on the guidelines for mesh creation in the SPDE process(Lindgren2011, ). The can be based on population density, which is available, for example, from World Pop(Tatem2014, ; Stevens2015, ).

For our simulation studies as well as our analysis of U5MR we will use values of generated from population rasters made by WorldPop for females age 15–49, which is the age group used for the MICS 2014 dataset.

In an ideal world, we would have the locations of all the clusters in the sampling frame available, and we could then choose to mimic the actual selection probabilities in the survey. This information will rarely be available, and so we instead, effectively, try to approximate the sampling frame locations. Using the population density at the cells also makes sense, since many household surveys use probability proportional to size sampling for the clusters.

3.2 Previous Approaches

There have been two previous approaches to incorporating point masked data into spatial analysis in a health setting. In a recent analysis by Utazi et al(Utazi2018a, ), the authors presented a method for use with DHS data in which all locations are masked. The method, henceforth referred to as the Ecological approach, was applied to the estimation of vaccination rates in Afghanistan and Pakistan for a single year where only masked data are available. They then make predictions on a grid, based on the masked data only. Let and represent the sums of the numerators and denominators in strata (i.e., summed over all masked clusters in ). The key observation here is that these values do not represent the population totals in strata , but rather the sums over the sampled clusters. However, the Ecological method that is fitted is in the spirit of observing the totals, and so attempts to obtain the averaged risk. Unfortunately, a second, and serious problem with the approach is that the method is susceptible to ecological bias(Wakefield2008, ) and the change of support problem (COSP)(Gotway2002, ; Bradley2016, ), otherwise known as the modifiable areal unit problem (MAUP)(Wong2009, ). The Ecological model is,


for , where is the size of strata . The reason that this model is susceptible to ecological bias is that, if the data were a total from the area, then should be evaluated as the average of the risk function, and should not be the risk function evaluated at the averages of the covariates and the average of the spatial field. We note in passing a far smaller problem, that the integrals in (3.2) should be weighted by population density. For almost all geographies, populations are not evenly distributed in space and the uneven distribution should be accounted for, as has been done for many continuous spatial modeling post-estimation aggregation endeavors(Wakefield2019, ; Golding2017, ; Burstein2019, ). Predictions on a grid , are obtained via:


for , where , and indicates that we pick up the random effect for the strata that grid point lies within. In general, when one moves between geographical levels, with a nonlinear model, the regression model changes form, i.e., if one averages the expit function (at the point level), one does not obtain an expit function.

The authors state that they are following on from previous work(Moraga2017, ), but in this paper, the totals over areas (block averages) were being modeled, and the likelihood was normal with a linear model. For this likelihood, the aggregation is straightforward. The extension to modeling the total over areas for Poisson data has also been previously considered(Wilson2018, ). We emphasize that this is all in the context of modeling a total, but that is not the situation for the DHS example considered in the paper that introduces the Ecological method(Utazi2018a, ).

Another approach which has been applied in several different research applications (Burstein2019, ) is the method first described by Golding et al.(Golding2017, ) for child mortality estimation. This method, which we will refer to as the Resample method, can be thought of as a two-step process, that first restructures the data so that it can be used in a more traditional spatio-temporal modeling approach. The pre-processing takes data that is without a known point location and distributes those points randomly in the given area from which the data was collected. A -means clustering algorithm is used to decide on a set of point locations at which to redistribute the masked data points. Specifically, 10,000 point are randomly assigned to cells within an area relative to the population density in that area and then clustered in space relative to a threshold which can be thought of as the number of people represented at each point. Once the points are -means clustered, the individual observations are assigned to each -means cluster relative to the population that each cluster represents. The threshold for how many -mean clusters are created can be changed depending on the setting, however, most analysis using this approach use a similar threshold for distributing points in an area.

The Resample method restructures the input data so that it meets the criteria for inclusion into a traditional spatio-temporal models. Unfortunately, however, the method (like the Ecological method) is more suited to data that are totals over areas (or strata), rather than a subset of data with unknown locations. In the Mixture method that we propose, we are effectively averaging over the potential locations of the mnasked clusters, rather than spreading out the signal from those clusters. The Resample method ignores the uncertainty of attributing data to a location when that location is unknown. Because data is randomly assigned to a location, model uncertainty is artificially deflated for these locations making estimates look more certain than they actually should be, given that it is not known if data were actually collected from that location. In addition, the relationship between covariates and the outcome could become obfuscated since those variables could also be, and most likely are, wrongly attributed to observations.

3.3 Model Fitting Process and the Laplace Approximation

MCMC is notoriously inefficient for GP models(Filippone2013, ) and so instead we rely on analytic approximations(Lindgren2011, ). The R implementation of the INLA approach(Rue2009, ), however, is unfortunately unavailable for the Mixture method that is required for the analysis of masked data. A similar Laplace approximation is available in the Template Model Builder (TMB) R package (Kristensen2016, )

, however. Informally, the Laplace approximation relies on the spatial random effects having posteriors that are not too far from being normally distributed. In general, Laplace type methods are not conducive to being applied to Mixture models. To examine the accuracy of the Laplace approximation for our Mixture method, we perform simulations to test whether the Laplace approximation provides similar results to a more robust (albeit computationally expensive) inference method, MCMC. In both simulation scenarios MCMC and the Laplace approximation provided similar results, particularly for the final estimated field (figures shown in the appendix). In our scenario, there is little information in each cluster, so that the prior (which is normal) dominates. The time difference between the two methods of inference is striking. While the models estimated using the Laplace approximation took several minutes to run, MCMC models ran on the order of days until chains reached acceptable levels of convergence. The time difference was close to a 500 times speed increase when using the Laplace approximation for the same model fit using MCMC. When running the Ecological method we use the INLA package and base our code on the sample coded provided in the original paper

(Utazi2018a, ). For the Resample method, we worked closely with methodological contributors of the original paper in order to implement the method using TMB.

4 Simulation Studies

4.1 Model Comparison in Unit Square Simulations

To test the performance of our approach, we simulate a wide variety of randomly-generated risk fields. Since we know the true parameters in each simulation, we will be able to measure how closely our model was able to recover the true risk field. For reference, we compare our method to the Ecological and Resample methods. This is the first comparative evaluation of the performance of models for continuous spatial estimation from point masked data in a known simulation environment. In addition, we also include two additional approaches to calibrate model performance. First, we ignore the masked data to see how much information is provided when it is included in the model; we refer to this approach as the Ignore model. Second, we fit a model where the true point location of the masked detail is known to the analyst and we may use a spatial-temporal model on the totality of data. This scenario can only be implemented in a simulation environment where the masking of locations is done only for the sake of testing masked data modeling approaches. This approach will be referred to as Unmasked, which other methods should strive to approach in terms of accuracy.

For our first simulation we consider a single spatial field and take the risk of an event at location as


where is a spatial covariate that will be generated from three different distributions.

Figure 2: Examples of three simulated risk fields using each of the three covariate types and the same latent spatial field. The risk field on the left is found by adding the components shown on the right, cell by cell, into the inverse logit function shown in Equation (4).

We lean heavily on the simulation design described by Utazi et al.(Utazi2018a, ). To generate the 45 risk fields we will use in our simulations, we begin with a 6060 cell grid. We assume that the risk for each cell in the grid is generated by an inverse logit of an intercept, one covariate, and a latent spatial field. We create variety in the fields, by varying the following: covariate type, covariate coefficient value, and spatial range. The intercept we hold constant ( for all fields). Since in practice, a variety of covariates generate real-life risk fields, we consider three types: an independent and identically distributed covariate, a covariate correlated across space, and a covariate constant across large geographic regions. Figure 2 illustrates.

For each of these covariate types, we generate fields with covariate values, , in . We also vary the spatial range (distance at which correlation drops to 0.1), over the values . This value dictates the spatial correlation of the spatial components in our model (either the latent spatial field alone or both the latent spatial field and the spatially-correlated covariate). The components that go into these 45 risk fields are shown in Figure 3.

Figure 3: A visual aid to understanding how we generate each the 45 risk fields and 1620 datasets in our simulation.

We sample from each of these generated risk fields to create 1,620 datasets for model testing. Each generation is a combination of a spatial field, one of two sampling types, one of three sampling grids, and one of six sample sizes. The sampling grids divide the spatial field into smaller sub-units that are analogous to strata in the DHS and MICS sampling context. The sampling grids are either 33, 55, or 1010 generating 9, 25 and 100 sub-units respectively. By sampling types we mean the relationship between sub-units and whether data is point masked or not. In the overlap scenario, data that is masked and data with point known locations are simulated from the same sub-units while in the non-overlapping scenario, sub units have data points that are either point known or point masked but not both. The varying of sample size, {50, 100, 150, 200, 250, 300} observations per sampling grid unit, reflects the actual variation in samples collected from the DHS and MICS.

After creating this simulation environment, we fit our five candidate approaches on each of the 1,620 datasets. Model performance is assessed on the known 60

60 cell level probability field by a number of validity metrics. Root mean squared error (RMSE), coverage of the 95% credible/confidence interval at the cell level, and bias of the risk estimates are calculated to compare models. For presentation, we divide our results by covariate type and spatial range to inspect whether there are different compositions of the underlying probability field that favor one model over another. We repeat this entire process 10 times for each simulation scenario – this is a small number in the context of simulation studies, but the whole exercise is computationally expensive and we would rather examine a range of scenarios. As we will see from the results, though this number is small, the pattern that emerges is unambiguous.

In all testing environments, we find that our proposed Mixture method out performs previously presented methods in all of the evaluation diagnostics and is a significant improvement over ignoring the spatially masked data. Results in Figure 4 show the relative improvement in RMSE from each of the methods as compared to the Ecological method. The Mixture method improves on the Ecological method by 4% to 14% depending on the covariate type and spatial range of the underlying probability field. While ignoring the data is consistently the worst performing option presented, the Resample method also fails to perform as well as either the Ecological or the Mixture methods. In terms of coverage, the Mixture method also the most accurate (see Appendix), closest to 95%, as well as the smallest variation around the true 95% coverage, when compared to the candidate models (apart from the Unmasked case).

Figure 4: Improvement in models, relative to the Ecological method, in terms of RMSE. Each column block corresponds to a different covariate type, and each row block to a different range parameter. Ignore model removed because of poor performance.

4.2 Dominican Republic Simulations

In addition to running simulations on a controlled spatial and sampling environment, we are also concerned with how performance differs in a environment where the sampling frame is fixed for an outcome. In the case of sampling child mortality data in the DR, we simulate a spatial risk field over the country and sample in a manner that mimics the sampling frame of the MICS and DHS surveys that took place in the country. By doing so we are able to see which model is able to best reconstruct simulated fields, conditional on the spatial distribution of the population and the locations of samples either up to a given geolocation (point data) as specified in the DHS or to a given stratified administrative areal unit such as in the MICS (masked data). Our simulated field across the DR contains 4532 cells where each cell is approximately a 55 kilometer unit. Because birth history data also provides historical information on child mortality we simulate a spatio-temporal process over the period 2000–2015, using a spatial GP crossed with an AR1 temporal model, as described in equation (1). Death can occur in one of 6 intervals, namely, the first month of life, months 2–6, 7–12, 13–24, 25–36, 37–48, 49–60. In our simulation we have a combination of point and masked data for the first five years of the simulation while the last year has only masked data available for analysis, mimicking the data structure of the 2013 DHS and MICS 2014 modules.

We plot the geolocations of clusters that were sampled across the DR from the 2013 DHS survey as an example of a simulation field in Figure 5. Complete birth histories give us retrospective data for each mother interviewed; we generate our data to have geolocated point observations for years from 2000 to 2013. As stated, MICS data is masked so one only knows the geographical strata in which the sampled cluster lies and not the true cluster location. For each simulation in this experiment we simulate true cluster locations for clusters from the MICS survey, and generate samples at the locations and time points . Cells with a greater population of women ages 15–60 have a higher probability of being selected for cluster placement. The probability that a cluster is placed in any given 55 kilometer cell is where is the population of women age 15–60 in cell (based on the WorldPop estimate) and is the population of women age 15–60 in the strata within which cell is located. Cells may contain more than one cluster, as clusters need not be more than 5 kilometers apart.

Because the surveys used stratified sampling by region and urban/rural, we extend the method described by Utazi et al(Utazi2018a, ). In particular, the Ecological method we use is a conditional auto-regressive (CAR) model with an adjacency matrix for the regions represented in the MICS survey, but including a fixed effect for urban/rural status.

Figure 5: One example simulation risk field on the DR over time, along with cluster sample locations, which are directly pulled from the 2013 DHS.

The underlying spatio-temporal field is simulated using a formulation similar to the previous grid simulation and equation (4

). The difference is that the underlying latent field changes over space and time (over 6 hypothetical years of data) with a precision matrix that is the Kronecker product of an AR1 time series model and a Matérn spatial GP, similar to the model we use to estimate U5MR. The variance for the process is fixed at 1 and the temporal (AR1) correlation is set to 0.95 for all simulations of the spatio-temporal field. Random values for the covariate effects, covariate types, spatial range of the geographic correlation process, and the placement of the clusters in our simulation, are altered for each simulation in the same manor as the grid simulation. We do not, however, change the sample size, grid, or types, as done in the previous simulation, as these values are dependent on the clusters sampled in the DHS and MICS modules. In total we create 45 risk fields, and repeat this process 10 times for additional samples. We compare models using the same performance metrics as for the unit square simulation at the cell level. In addition, we also aggregate risk fields to the 31 province level averages and compare how population weighted aggregated estimates compare with true known values. Typically, small area aggregate measures are more useful for evaluating progress and policy making decisions

(Wakefield2019, ; Burstein2019, ) and thus comparing how models fair in their ability to recover the true value of health risks at a administrative level is of great importance. We use the same metrics as for the cell level validation in order to assess the performance at the regional level.

We find similar results for the conditional simulation scenario compared to the unit square simulation. The Mixture method outperforms the Resample method, the Ecological method, and ignoring the point masked data, for both the coverage and RMSE metrics. Interestingly, the Resample method appears to be somewhat improved over the Ecological method, in contrast to the first simulation.

Model improvement over the Ecological method ranged from 7% to 16% average improvement over the combinations of spatial range and covariate type (figures for cell level comparison may be found in the Appendix). When aggregated to the province level we find that discrepancies between the Ecological method and the Mixture method are exacerbated. We find that the RMSE of province level risk is on average more improved upon than at the cell level when comparing the Ecological method to the Mixture method, with results improving up to 25% on average under different combinations of spatial range and covariate type. Comparisons at the province level of RMSE for a single simulation across all time periods are shown in Figure 6, with the columns corresponding to years of data, labeled 0 through 5. Coverage estimates at both the cell level and the province level are more reliable and unbiased for the Mixture method, compared to both the Resample method and ignoring the masked polygon data. Comparisons of bias of estimates and bias of dissimilarity measures at both province and the cell level may be found in the Appendix.

Figure 6: Comparison of Absolute Error at the Province level for DR simulation. Columns represent 6 different years of analysis and predictions while rows represent the different methodological approaches. Higher values indicate worse absolute error.

5 Results for the Dominican Republic

For the U5MR simulation and application, we do not have a single binary indicator, but rather a series of binary indicators corresponding to a discrete hazards formulation. We use a modeling approach similar to previous work(Wakefield2019, ), though now including data that is location masked and we do not include an additional adjustment for HIV bias correction as prevalence among women in the child bearing age groups has been less than 1.2% in the DR (Frank2019, ). An additional difference is the binning of age groups and the associated parameter effects. We also used the age groups of Burstein et al.(Burstein2019, ) with random walks across time.

For each of the Mixture, the Resample and the Ignore approaches we fit 3 models of increasing complexity, which account for different factors being important for child mortality. We take as our aim estimating U5MR at the region level over the period 2000–2014. In the model we define, the mortality risk as a function of location , time , age , survey , cluster and the urban/rural status of the cluster. To simplify notation we do not explicitly make the probabilities a function of all of these parameters, but for age it is important, since the responses recorded are a function of and so we write and to denote the number of deaths and number of exposure months in age group at location and in year , with associated hazard . The models we use are summarized as:


where is an indicator of whether the cluster lies in an urban region, with the associated odds ratio. Model 1 acts as our baseline model and includes, in addition to the urban/rural effect, a fixed effect for each age group, a survey random effect and a spatio-temporal (GP) effect . Model 2 accounts for possible differences in the trajectories of age specific mortality by adding independent random walks of order 2 for each age group , with a common variance . Model 3 builds on Model 2 and accounts for variation within clusters (excess-binomial variation) through an independent and identically distributed random effect for each cluster . The cluster label is a function of . More details on the model specifications may be found in the Appendix.

For each parameter configurations we can test the different methods for handling point masked data either by ignoring it, implementing the Resample method, or using the Mixture method giving us a total of 9 fitted models. For each of these candidate models, we assess performance by fitting the model to all data except for one year of data which is held out from model fitting and only used for model validation. This leaves us with 14 different trails to pick a best model from the 9 candidates. We evaluate models by assessing their likelihood value on point data that was held out from the model fitting process.

Overall, the Mixture method again outperformed the other models that were tested. For the 14 years of out of sample data available for comparison the best performing models were: Mixture model 2 six times, Resample model 3 two times, Resample model 1 two times, and Mixture model 3 one time, Ignore model 1 one time, Ignore model 2 one time, and Resample model 2 one time. This out of sample testing allows us to put more faith in choosing the Mixture method for the DR data context. For the remainder of the paper we will use Mixture model 2 when referring to the Mixture model and Resample model 3 when referring to the Resample model in the final results.

Figure 7: Province level estimates of U5MR per 1000 births, for the Mixture Model.

Figure 7 shows estimates of province level changes of U5MR from 2000 to 2015 for the Mixture model. The DR has seen dramatic changes in U5MR since the early 1960’s however more recently changes are less striking across the country, as reported by a number of sources(Wang2012, ; Dicker2018, ). Our estimates are consistent with previous findings, as there is not strong temporal variation in the estimates. However, there is persistent strong geographic variation. The west of the DR has higher levels of U5MR and the central north has the lowest levels. If we compare the results to the Resample model, we see that there is a different temporal result, with a significant decline in U5MR since 2000, despite having similar spatial variation in mortality. This very different trend in later years despite using the same data brings to question which model we should place more value in.

In order to get a better point of comparison we show results for both models, aggregated to the country level, in Figure 8, along with other studies and publications which estimate child mortality within the DR at this level. We include for comparison the Global Burden of Disease (GBD) 2017 national level results(Dicker2018, ). In the appendix we include as well comparisons to the direct estimates from the 2013 DHS and 2014 MICS using Horvitz Thompson (HT) weighted estimators(Horvitz1952, ) used in other spatial child mortality contexts(Mercer2015, ) as well as smoothed national level estimates(Mercer2015, ). We see that all models, except for the Resample model show no indication of a strong temporal change in the U5MR between 2000 and 2014. The potential reasons for these differences are many-fold. However, the Resample method can not be statistically justified from a methodological viewpoint and it did not perform as well as the Mixture method in either of the simulation scenarios. Further, it was also inferior in the sample validation tests. So we do not believe it advisable to accept the Resample results, which differ from every other model of child mortality that we have presented. All code used to run these analysis may be found on the authors github page, https://github.com/nmmarquez/PointPolygon, or upon request.

Figure 8: National level estimates of the U5MR over time, from different models.

6 Discussion

Precise estimates of epidemiological and public health outcomes are vital for targeted policy making to improve the standing of those that lie at the margins of good health. In an attempt to provide better fine scale geographic estimates of measures such as the child mortality rate(Golding2017, ; Burstein2019, ) and vaccination coverage(Utazi2018a, ), researchers have proposed methods to incorporate not only point located data but also masked data. To date, however, these methods have not been compared for accuracy and validity, and neither the Ecological nor the Resample methods accurately reflect any feasible data generating process.

In this paper we present an alternative method for incorporating polygon data into continuous spatial analysis. We build two spatial simulation scenarios, one generalized scenario simulating continuous spatial fields on a 11 grid that is common for simulations in the spatial literature and another reflecting a typical scenario in global health and epidemiology research which use DHS and MICS surveys. For each simulation scenario we build various spatial fields and compare our new Mixture method to the Ecological and Resample methods. We found in all simulations, via a number of validation techniques, that our newly presented Mixture method outperformed both the other methods and approaches which neglect to incorporate the masked data and use point located data only.

In addition, we also fit a variety of increasingly complex versions of the new Mixture method, the Resample method, and a traditional continuous spatial field model, to child mortality data collected from MICS and DHS surveys. We perform a number of out of sample validation tests by withholding one year of data from the model fitting process and assessing a model based on how well it was able to predict that out of sample data by assessing the probability of the left out data using model predictions. For a majority of annual holdout tests we find that variations of the Mixture method outperform other methods tested in terms of out of sample predictive validity. Furthermore, when using the entire data set, the Mixture method produces results at the national level that are more consistent with previous estimates of the U5MR in the DR in comparison to the next best performing model, which uses the Resample method.

Where we do find similarities in all models is the spatial variation that exists in any given year. While the degree of these differences vary from model to model, all results show especially pronounced differences between eastern and western provinces. This finding is corroborated by the results in a publication by Burstein et al(Burstein2019, ). Though their method of spatial analysis follows the Resample paradigm, their final results are adjusted such that they are consistent with the GBD 2017 DR child mortality results, and thus show little change over time in the country level estimates of child mortality.

Ensuring that we are accurately incorporating data into our modeling process is vital to producing valid estimates of public health phenomena, especially when the results are to be used for targeted policy making. In our analysis we show that our newly presented Mixture method is able to better leverage data that is not typically suited for continuous spatial analysis, to improve the accuracy of our results. We acknowledge that the scenario considered only represents one specific type of areal data, data that was originally collected at the cluster level, but with the locations masked from the end user. Many other types of polygon data do exist, such as average risk over administrative data found from vital registration records or censuses. This situation has been considered previously(Wilson2018, ). We also do not consider the effect of jittering. Nonetheless, the type of masked data presented in this paper represents a substantial portion of traditionally used data sources for health outcomes such as the DHS and MICS.


Wakefield was supported by National Institutes of Health grants R01CA095994 and R01AI029168. Nat Henry, Aaron Osgood-Zimmerman, and Katie Wilson have provided valuable feedback during the course of this research.


  • (1) Bhutta ZA. Mapping the geography of child mortality: a key step in addressing disparities. The Lancet Global Health 2016; 4: e877–e878.
  • (2) Desmond-Hellmann S. Progress lies in precision. Science 2016; 353: 731.
  • (3) Wakefield J, Fuglstad GA, Riebler A et al. Estimating under-five mortality in space and time in a developing world context. Statistical Methods in Medical Research 2019; 8: 2614–2634.
  • (4) Diggle P and Giorgi E. Model-based geostatistics for global public health : methods and applications. First ed. Chapman and Hall/CRC, 2019.
  • (5) Alkema L and New JR. Global estimation of child mortality using a Bayesian B-spline bias-reduction model. The Annals of Applied Statistics 2014; 8: 2122–2149.
  • (6) Burgert-Brucker CR, Dontamsetti T, Mashall A et al. Guidance for use of the DHS program modeled map surfaces. Technical report, ICF International, Rockville, MD, 2016.
  • (7) Gething P, Tatem A, Bird T et al.

    Creating spatial interpolation surfaces with DHS data.

    Technical report, ICF International, Rockville, MD, 2015.
  • (8) Golding N, Burstein R, Longbottom J et al. Mapping under-5 and neonatal mortality in Africa, 2000-15: a baseline analysis for the Sustainable Development Goals. The Lancet 2017; 390: 2171–2182.
  • (9) Reiner RC, Graetz N, Casey DC et al. Variation in childhood diarrheal morbidity and mortality in africa, 2000–2015. New England Journal of Medicine 2018; 379: 1128–1138.
  • (10) Utazi C, Thorley J, Alegana V et al. A spatial regression model for the disaggregation of areal unit based data to high-resolution grids with application to vaccination coverage mapping. Statistical Methods in Medical Research 2019; 28: 3226–3241.
  • (11) Mahapatra P, Shibuya K, Lopez AD et al. Civil registration systems and vital statistics: successes and missed opportunities. The Lancet 2007; 370: 1653–1663.
  • (12) Mikkelsen L, Phillips DE, AbouZahr C et al. A global assessment of civil registration and vital statistics systems: monitoring data quality and progress. The Lancet 2015; 386: 1395–1406.
  • (13) Encuesta demográfica y de salud 2007. Technical report, Centro de Estudios Sociales y Demográficos - CESDEM/República Dominicana and Macro International, Santo Domingo, República Dominicana, 2008.
  • (14) Encuesta demográfica y de salud 2013. Technical report, Centro de Estudios Sociales y Demográficos - CESDEM/República Dominicana and ICF International, Santo Domingo, República Dominicana, 2014.
  • (15) Encuesta de indicadores múltiples por conglomerados año del trabajo de campo: 2014. Technical report, Oficina Nacional de Estadistica, 2016.
  • (16) Wilson K and Wakefield J. Pointless spatial modeling. Biostatistics 2018; Published Online 06 September 2018.
  • (17) Heaton MJ, Datta A, Finley AO et al. A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics 2019; 24: 398–425.
  • (18) Lindgren F, Rue H and Lindström J. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society, Series B 2011; 73: 423–498.
  • (19) Filippone M, Zhong M and Girolami M. A comparative evaluation of stochastic-based inference methods for Gaussian process models. Machine Learning 2013; 93: 93–114.
  • (20) Rue H, Martino S and Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society, Series B 2009; 71: 319–392.
  • (21) Lindgren F and Rue H. Bayesian spatial modelling with R - INLA. Journal of Statistical Software 2015; 63: 1–25.
  • (22) Kristensen K, Nielsen A, Berg CW et al. TMB : automatic differentiation and Laplace approximation. Journal of Statistical Software 2016; 70: 1–21.
  • (23) Thygesen UH, Albertsen CM, Berg CW et al. Validation of ecological state space models using the Laplace approximation. Environmental and Ecological Statistics 2017; 24: 317–339.
  • (24) Thorson JT, Skaug HJ, Kristensen K et al. The importance of spatial models for estimating the strength of density dependence. Ecology 2015; 96: 1202–1212.
  • (25) Tatem AJ. Mapping the denominator: spatial demography in the measurement of progress. International Health 2014; 6: 153–155.
  • (26) Stevens FR, Gaughan AE, Linard C et al.

    Disaggregating census data for population mapping using random forests with remotely-sensed and ancillary data.

    PLoS One 2015; 10: e0107042.
  • (27) Wakefield J. Ecologic studies revisited. Annual Review of Public Health 2008; 29: 75–90.
  • (28) Gotway CA and Young LJ. Combining incompatible spatial data. Journal of the American Statistical Association 2002; 97: 632–648.
  • (29) Bradley JR, Wikle CK and Holan SH. Bayesian spatial change of support for count-valued survey data with application to the American Community Survey. Journal of the American Statistical Association 2016; 111: 472–487.
  • (30) Wong D. The modifiable areal unit problem (MAUP). In The SAGE Handbook of Spatial Analysis. London, United Kingdom: SAGE Publications, 2009. pp. 104–123.
  • (31) Burstein R, Henry NJ, Collison ML et al. Mapping 123 million neonatal, infant and child deaths between 2000 and 2017. Nature 2019; 574: 353–358.
  • (32) Moraga P, Cramb SM, Mengersen KL et al. A geostatistical model for combined analysis of point-level and area-level data using INLA and SPDE. Spatial Statistics 2017; 21: 27–41.
  • (33) Frank TD, Carter A, Jahagirdar D et al. Global, regional, and national incidence, prevalence, and mortality of HIV, 1980-2017, and forecasts to 2030, for 195 countries and territories: a systematic analysis for the Global Burden of Diseases, Injuries, and Risk Factors Study 2017. The Lancet 2019; 6: e831–e859.
  • (34) Wang H, Dwyer-Lindgren L, Lofgren KT et al. Age-specific and sex-specific mortality in 187 countries, 1970-2010: a systematic analysis for the Global Burden of Disease Study 2010. The Lancet 2012; 380: 2071–2094.
  • (35) Dicker D, Nguyen G, Abate D et al. Global, regional, and national age-sex-specific mortality and life expectancy, 1950-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet 2018; 392: 1684–1735.
  • (36) Horvitz DG and Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 1952; 47: 663–685.
  • (37) Mercer LD, Wakefield J, Pantazis A et al. Space-time smoothing of complex survey data: small area estimation for child mortality. The Annals of Applied Statistics 2015; 9: 1889–1905.

Appendix A Appendix

a.1 Appendix A: Model Specifications

a.1.1 Grid Square Simulation

a.1.2 Dominican Republic Simulation


a.1.3 Dominican Republic Child Mortality Estimation


a.2 Appendix B: MCMC and Laplace Approximation

To assess the validity of using a Lpalace Approximation for the mixture model estimation process we took a single dataset from the unit square simulation scenario and estimated the model using both a Laplace approximation and using MCMC. We use the same model fitting process for the Laplace Approximation explained previously. For the MCMC portion we fit the model using STAN and a no U-turn sampler. Four chains were fit and the mean posterior of the field probability estimates for each chain were compared to the mean estimated probability of the fitted Laplace Approximation fit. In addition we visually compared the posterior standard deviation of the probability field to that of the Laplace fit and found all estimates to be similar. Despite the models producing similar results the run times are drastically different. While the Laplace Approximation took less than 2 minutes to fit each MCMC chain took 4 days to run 2000 steps and sensible convergence was reached. Fitting models using MCMC is infeasible in our entire simulation framework where we are comparing thousands of fit models, however, with the nearly 4000 times speed up from the Laplace approximation we are able to test a wider array of simulation scenarios to assess model performance.

Figure 9:

Mean and 95% credible interval for posterior of estimated probability field for one MCMC chain compared to mean estimate of Laplace Approximation fit.

Figure 10:

Standard deviation of posterior and standard error of probability for MCMC and Laplace Approximation using Template Model Builder respectively.

a.3 Appendix C: Grid Square Simulation Results

Figure 11:

Mean and 95% quantile range for 95% confidence/credible intervals for cell level estimates of risk field across all simulations.

Figure 12: Mean and 95% quantile range for bias measures for cell level estimates of risk field across all simulations.

a.4 Appendix D: Dominican Republic Simulation Results

Figure 13: Mean and 95% quantile range for RMSE for cell level estimates of risk field across all simulations.
Figure 14: Mean and 95% quantile range for 95% confidence/credible intervals for cell level estimates of risk field across all simulations.
Figure 15: Mean and 95% quantile range for bias measures for cell level estimates of risk field across all simulations.
Figure 16: Mean and 95% quantile range for RMSE for province level estimates of risk field across all simulations.
Figure 17: Mean and 95% quantile range for 95% confidence/credible intervals for province level estimates of risk field across all simulations, for provinces.

a.5 Appendix E: Comparison of U5MR Results

Figure 18: National level estimates of the U5MR over time, from candidate models and direct estimates.
Figure 19: National level estimates of the U5MR over time, from candidate models and smoothed direct estimates.