Intercensal updating using structure-preserving methods and satellite imagery

by   Till Koebe, et al.
Freie Universität Berlin

Censuses are fundamental building blocks of most modern-day societies, yet collected every ten years at best. We propose an extension of the widely popular census updating technique Structure Preserving Estimation by incorporating auxiliary information in order to take ongoing subnational population shifts into account. We apply our method by incorporating satellite imagery as additional source to derive annual small-area updates of multidimensional poverty indicators from 2013 to 2020 for a population at risk: female-headed households in Senegal. We evaluate the performance of our proposal using data from two different census periods.



There are no comments yet.


page 1

page 8

page 10

page 12

page 22


Mapping Missing Population in Rural India: A Deep Learning Approach with Satellite Imagery

Millions of people worldwide are absent from their country's census. Acc...

Deep Built-Structure Counting in Satellite Imagery Using Attention Based Re-Weighting

In this paper, we attempt to address the challenging problem of counting...

Creation of digital elevation models for river floodplains

A procedure for constructing a digital elevation model (DEM) of the nort...

The SpaceNet Multi-Temporal Urban Development Challenge

Building footprints provide a useful proxy for a great many humanitarian...

Wheat Crop Yield Prediction Using Deep LSTM Model

An in-season early crop yield forecast before harvest can benefit the fa...

Using satellite imagery to understand and promote sustainable development

Accurate and comprehensive measurements of a range of sustainable develo...

BUDD: Multi-modal Bayesian Updating Deforestation Detections

The global phenomenon of forest degradation is a pressing issue with sev...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The estimation and monitoring of key statistical indicators are essential for the definition and adequate execution of public policies. How many young women without educational attainment currently live in district A? Answers to questions like this inform local policy-making on a daily basis. On a global scale, the 2030 Agenda for Sustainable Development highlights the need for information to be disaggregated not only geographically, but also along other dimensions such as sex, age, and employment status to identify and reach the most vulnerable in a society ([UnitedNationsGeneralAssembly2015ResDevelopment]). Statistical systems around the world increasingly face growing demand for statistical information with specific quality characteristics such as disaggregation, periodicity and relevance in order to guide policy decisions, since these require a more nuanced and up-to-date picture than national-level indicators can provide. However, serving this demand appropriately remains a challenge in developing countries: censuses are collected every ten years at best for a limited scope of indicators. Household surveys are usually conducted more frequently covering a broad range of indicators, although they may not provide representative results at higher levels of disaggregation. In addition, data collection costs for both instruments scale unfavourably with the sample size. While some countries can link statistical data with high-quality administrative records in order to fill gaps, e.g. using small area estimation methods ([Pfe13, RaoMol15, Tzavidis2018FromStatistics]), many others lack appropriate administrative data systems to do so.

To overcome the shortage of geographically disaggregated data, academic literature has seen a rise of approaches in recent years that use non-traditional (big) data sources. In particular, night-time lights and land use information captured by satellites, as well as mobility and social network insights via mobile phone metadata, are increasingly used for estimating disaggregated key statistical indicators related to economic growth ([Chen2011UsingStatistics, Henderson2012MeasuringSpaceb, Pinkovskiy2016LightsDebate]), literacy ([Sundsy2016CanCountry, Schmid2017ConstructingSenegalb]), population density ([Harvey2002EstimatingLimitations, Bonafilia2019MappingLearning, Leyk2019AllocatingUseb, Steinnocher2019EstimatingImagery]) or poverty ([Jean2016CombiningPoverty, Pokhriyal2017, Weidmann2017UsingWealth]). The United Nations (UN) recommends using satellite imagery to prioritize and check geospatial processes such as the delineation of enumeration areas during census preparation ([DepartmentofEconomicandSocialAffairsUN2009HandbookActivitiesc]). It further supports the construction of population grids as a common spatial reference system as proposed by [Stevens2015DisaggregatingData, Freire2016DevelopmentResolution, Boo2020ASurveys].

While these approaches can increase spatial granularity by assuming that correlations captured for large areas also hold for small areas, they suffer of five major shortcomings: First, they still require disaggregated statistical data from surveys or censuses for training and model fitting, ideally captured at the same point in time. Second, only variables of interest that correlate with covariates from the novel data source can be modelled appropriately. Third, little research has been done so far to investigate whether these inferential relationships also hold over time. Fourth, some of the novel data sources are privately owned, e.g. high resolution satellite imagery and mobile phone metadata. Creating reliable long-term access options, a precondition for embedding them in statistical business processes, is an institutional and legal challenge on its own. Fifth, many new approaches use sophisticated machine learning in combination with data generated in processes beyond the control of statistical offices. Adopting them would require a major paradigm shift in the work of official statisticians. However, and most likely given the capacity constraints many statistical offices in the world face, it discourages adoption altogether.

Another strain of research, usually known as updating, intercensal or postcensal estimation methods, focuses on the periodicity of statistical data by tackling the problem of the time-gap between two sources of information. Traditional demographic techniques such the component method ([united1956manual]) and the vital rates method or regression symptomatic procedures ([Rao03]) build on long-term trends around fertility, mortality, and migration and have been extensively used to obtain postcensal population estimates, often called population projections. Most of these techniques rely on survey information, therefore limiting the level of spatial granularity for which population updates can be provided. Large-area population estimates are then distributed to small areas via the proportion method ([united1956manual]), i.e. fixed population shares captured in the latest census. Population shifts within these large areas, e.g. induced by urbanization, desertification or internal conflicts are thus not accounted for. Hence, the lack of disaggregated, frequent, relevant, reliable, and timely statistics is still largely unsolved.

The structure preserving estimation (SPREE) method has been widely used to produce intercensal estimates for disaggregated population counts ([Purcell1980PostcensalDomains]). Unemployment ([luna2015small]), occupational classification ([Hid09]) and poverty estimation ([Isidro2016ExtendedPovertyb]) are examples of the application of this method and its extensions. In SPREE, disaggregated postcensal estimates are generated by updating the margins (called the allocation structure) of a census composition (called the association structure). The updated margins, i.e. the distribution of the variable of interest (e.g. poverty status by sex) and the population sizes for each domain, are usually provided by recent survey data. In case population sizes by domains cannot be sufficiently generated due to sampling design constraints, population projections as described above can be used instead ([luna2016multivariate]).

In this paper, we provide a unified framework for the two aforementioned, so far mainly disparate, fields of research - using novel data sources to add spatial granularity and updating existing census structures with long-term trends and recent data to add disaggregation, periodicity, and reliability. Specifically, we study the potential benefit of including auxiliary data in the SPREE process by strengthening the population estimates in local areas. The aim is relaxing rigid assumptions on the subnational population allocation between census rounds by tracking population changes across small areas over time using auxiliary information without jeopardizing the compelling simplicity of the SPREE approach. We also provide uncertainty estimates for the postcensal predictions based on our approach following a semiparametric bootstrap procedure. We apply the proposed methodology in a case study in Senegal. Since its latest census in 2013, Senegal has started to roll out a nation-wide social safety net program to lift families out of poverty including cash transfers for up to every fourth Senegalese household ([GouvernementduSenegal2017ProgrammePNBSF, ANSD2015EnqueteVulnerables1]). Therefore, we expect that the only source of commune-level data on multidimensional poverty, the census 2013 data, has become quickly outdated and therefore is ill-suited for further local policy guidance. In a first step, we validate the value added of our proposed methodology by updating the census 2002 to 2013 and comparing the results with the actual census 2013. Finally, using data from census 2013, multiple rounds of major household surveys, population projections, and satellite imagery-based population estimates we retrieve annual point predictions including corresponding mean squared error (MSE) estimates of the multidimensional headcount ratio for the 552 communes in Senegal up to 2020.

The document has the following structure: a summary of SPREE is included in Section 2 as well as the proposed updating strategy which incorporates auxiliary information such as satellite imagery. MSE estimates for corresponding point estimates are also provided. Section 3 is dedicated to introduce the case study and the data sources used in both the validation and the application. In Section 4 we present the simulation exercise to validate our proposal. Section 5 summarises the application results in which we provide small-area-level updates of the multidimensional headcount ratio for several postcensal years in Senegal. Conclusions and further research questions are presented in Section 6.

2 Methodology

The method proposed to obtain updated indicators in small areas by including auxiliary information is described in this section. After a brief introduction to SPREE, we present the main scientific contribution of this paper: incorporating auxiliary information to improve the accuracy of small-area census updates by relaxing the assumption of fixed population margins between census years.

2.1 Structure Preserving Estimation (SPREE)

SPREE is an updating technique for intercensal population compositions. In general, population compositions can be thought of as multiway contingency tables. We denote the composition of interest as

with representing the target areas i.e. small areas and representing the categories of interest, e.g. poor/non-poor, at time . Consequently, represents the row margins, i.e. population sizes for each area, and the column margins, i.e. distribution of the categories of interest at national level (global totals) since the last census. Here, we assume the initial population composition to be generated at time . Since that is usually the year of the latest census, we simply call it the census composition. For better legibility, we denote the census year without a time index, e.g. instead of . Further, we assume that smaller areas are disjoint subsets of larger areas (as it is commonly the case for administrative areas) which we denote with .

In order to illustrate how a census composition can be updated via the margins, we describe it as a log-linear model (here for a two-way case such as poverty status by area):

where represents the overall mean, the main effect of areas, the main effect of categories, and the interaction term also known as the association structure. The definition of these parameters can be found in detail in most of the SPREE literature (see e.g. [luna2016multivariate]). Following the original proposal of [Purcell1980PostcensalDomains], two basic structures are required to apply this methodology: an association structure which provides information about the interaction between areas (rows) and categories (columns), and an allocation structure represented by , and that provides information about area and category terms. Generally, in order to capture the association structure appropriately, each combination of areas and categories (called cells) must be backed by sufficient data. Especially for countries lacking effective administrative data systems, censuses are the only source of information at that level of detail as surveys usually suffer of sample size constraints. Consequently, the census composition is only available every ten years at best for many countries in the world and therefore the association structure is assumed to be constant between census years . Usually the allocation structure is gathered from direct survey estimates, however, spatial disaggregation here is often limited by the lack of reliable subnational population margins.

The fitting process in the original SPREE model is achieved via the iterative proportional fitting (IPF) algorithm ([Deming1940]), where reliable survey row and column margins are fitted to a census composition (). Then, the iterative process to obtain the updated point estimates is indicated as in [luna2016multivariate] as:


There are some basic limitations to this methodology:

  1. [label=()]

  2. the survey(s) used to generate and must contain the same variable definition as the census or a high correlate of it ([Green1998]),

  3. the survey margins must follow the same disaggregation scheme as defined in the census (areas/categories),

  4. the margin totals and need to add up perfectly,

  5. only categorical variables are allowed (e.g. poverty status = poor/not poor).

The literature on SPREE methodology can be roughly summarised into three groups: a) original SPREE as defined by [Purcell1980PostcensalDomains] and the proposal of [Noble2002] to formulate SPREE under the generalized linear model (GLM) framework, b) SPREE extensions focused on bias reduction as suggested by [Zhang2004SmallCross-classifications] and later extended by [luna2016multivariate] through incorporating cell-specific random effects and informative sampling design and c) SPREE extensions to account for situations where the variable of interest is not available in the census ([Isidro2016ExtendedPovertyb]). As far as we know, the only work that uses additional sources of information besides one census, survey data, and population projections is [luna2015small] who consider a second census composition for updating ethnicity categories by age groups in England in 2011. While the population census provided information for all age groups, a scholar census offered better information for children between five and 15 years old. The authors studied the possibilities of updating the estimates considering that one of the age groups has two different data sources available. Nonetheless, the authors did not observe improvements by using this additional source of information. However, they point to the need for further research to better understand the potential value added of auxiliary information in SPREE. Moreover, to the best of our knowledge, all studies in this field assume the presence of reliable and up-to-date population margins for the areas of interest, which may not be ubiquitously available in many small-area settings.

2.2 SPREE with auxiliary information

Our contribution here is straight-forward: as demographic projections and surveys usually provide reliable population totals only for large areas (), a common-place method uses fixed population proportions captured in for distributing large-area counts to small areas:


We relax this obviously strong assumption that subnational population shares remain fixed between censuses by using auxiliary information to track population changes dynamically for small areas over time. Consequently, we replace with , so Equation 2 becomes:


where and represent the population sizes of small areas and large areas at time estimated using auxiliary data. Notice that it may be possible to rely on only to estimate . However, demographic projections and surveys capture information on population whereabouts for large areas in a statistically rigorous and reliable manner and are therefore used as regional-level benchmarks through dasymetric mapping ([Stevens2015DisaggregatingData]). This allows us to combine the benefits from both auxiliary data and statistical data. First, reliable counts are distributed to smaller areas not covered so far. Second, only information on the within-area distribution of counts are used from auxiliary information, not on the totals. This makes SPREE for small areas less prone to quality issues in the auxiliary data.

Besides administrative records, big data sources such as mobile phone metadata and satellite imagery appear to be promising auxiliary information in SPREE. For satellite imagery, five characteristics make it a suitable candidate for that: Satellite imagery provides virtually 1) global coverage at 2) high spatial resolutions for 3) frequent time intervals on 4) human-made impact in a 5) structured and - at least to some extent - harmonised way. In regard to population changes, built settlement growth, and changes in night-time lights may constitute the potentially most relevant indicators on human impact from satellite imagery. Data products from remote sensing are currently rapidly evolving in terms of quality and user-friendliness. Examples for open-access data products from remote sensing are the Copernicus program ([Union2020Copernicus:Earth]) and [].

2.3 Assessment of uncertainty

The original SPREE approach assumes that the association structure is fixed and therefore only the allocation structure provides a variation component. While [Purcell1980PostcensalDomains] propose linearization or replication techniques to account for the introduced uncertainty, [rao1986synthetic], [isidro2010intercensal] and [luna2016multivariate]

provide also analytical approximations for the variance of the estimator and other measures of uncertainty for SPREE and its extensions. In contrast, the data generating process of auxiliary data in general and specifically the population margins derived from satellite imagery is usually not known and it has to be expected that uncertainty in these data sources follows complex, non-trivial patterns. For example, satellite imagery run through multiple stages of data processing with dependencies to other data sources such as OpenStreetMaps (

[Stevens2015DisaggregatingData]) before providing population estimates. To some extent, this complexity also holds true for uncertainty in demographic projections. As we assume the uncertainty stemming from external sources like satellite imagery to be non-negligible, but also non-trivial, we opt to follow the replication idea proposed by [Purcell1980PostcensalDomains] which is also the basis of some MSE alternatives suggested in [luna2016multivariate] instead of pursuing an analytical strategy that may require strong assumptions on some of the sources of uncertainty. Therefore, we propose an empirical approximation of the uncertainty based on a semiparametric bootstrap in order to give some indication on the trustworthiness of a point estimate while acknowledging that it most likely captures only a fraction of the total variation. Since the allocation structure is constructed based on auxiliary data, population projections and survey data, a fully parametric approach is not straightforward. For this reason and to avoid misspecification issues, the column and row margin are resampled in an non-parametric way, where the column margin is provided by direct survey estimates and the row margin is derived from auxiliary information. Following existing literature (see e.g. [luna2016multivariate]

), census data is replicated following a multinomial distribution. In addition, we account for the stochastic nature of the overall population by resampling the population of each area using a Poisson distribution. This step is complementary and intended for use cases where census counts are subject to uncertainty, e.g. due to quality issues or the availability of only a census sample. To summarize, the steps to obtain the MSE of the updated estimates

via a mixed semiparametric bootstrap are as follows:

  1. [label=()]

  2. obtain the point estimate from Equation 1 using the row margin as defined in Equation 3 and compute ,

  3. for :

    1. [label=()]

    2. create population composition following a multinomial distribution, denoted by , where the row margins are obtained assuming a Poisson distribution, defined by ,

    3. generate auxiliary-based row margin by resampling from ,

    4. generate survey-based column margin by resampling from ,

    5. apply SPREE (Equation 1) for each replicate, generating updated compositions ,

  4. the MSE of is estimated as:

3 Case study: multidimensional poverty in Senegal

In 2013, the Senegalese government launched a nation-wide family security benefits program as part of a wider social safety net initiative. Officially termed Programme National de Bourses de Sécurité Familiale (PNBSF), the goal of the ongoing program is ‘to contribute to the fight against the vulnerability and social exclusion of families through integrated social protection in order to promote their access to social transfers and to strengthen, among other things, their educational, productive and technical capacities’ ([GouvernementduSenegal2017ProgrammePNBSF]). Its main component are cash transfers to disadvantaged families that in turn commit to schooling, keeping vaccination records and registering newborns and children in civil status ([ANSD2015EnqueteVulnerables1]). The cash transfers (100,000 FCFA, i.e. approximately 180 USD per year and household) are accompanied by coordination, communication, and evaluation activities. After a one-year pilot phase with 50,000 poor or vulnerable Senegalese households in 2013, the program was scaled up to 250,000 households in 2014 and onwards. A further extension to a total of 400,000 households - that is approximately every fourth household in Senegal - is currently underway. Consequently, it can be expected that poverty-related measures for small areas in Senegal will have become quickly outdated since the latest census in 2013. Even though the program efficiency is monitored including a large-scale baseline survey considering treatment and control group, this does not provide a representative view on socio-economic development in Senegal in general and for sub-groups in the 552 communes of Senegal, specifically.

The aim of the case study is to provide annual census updates of multidimensional poverty for small areas in Senegal until 2020 to inform about main drivers of multidimensional poverty for a specific sub-group: individuals living in female-headed households. To make sure our methodology adds value vis-à-vis existing approaches, we validate our proposal using censuses from 2002 and 2013. As we combine multiple sources of data in this case study, we first introduce them together with the target indicator - the headcount ratio of the multidimensional poverty index - in this section before presenting the validation results in Section 4 and the application results in Section 5. The full code for this paper is available on GitHub:

3.1 Geography of Senegal

Senegal is administratively divided in 14 régions (NUTS1), 45 départements (NUTS2), 103 arrondissements (NUTS3) and 552 communes (NUTS4). The total population was at 13.5 million in 2013 and projected to be at 16.2 million in 2020. We choose the 552 communes as target domains in the application as they represent the level of administration responsible for the provision of public services in the decentralization efforts set out in the national development plan of Senegal (Plan Sénégal Émergent) ([GouvernementduSenegal2014PlanEmergent]). Since a number of land reforms have taken place in Senegal between 2002 and 2013, the smallest common denominator in terms of administrative boundaries for analysis between 2002, and 2013 is the NUTS3 level (103 arrondissement) of the census 2002 Recensement Général de la Population et de l’Habitat (RGPH 2002). For evaluation, we consequently aggregate the data of the census 2013 Recensement Général de la Population et de l’Habitat, de l’Agriculture et de l’Elevage (RGPHAE 2013) from the 552 communes to the 103 arrondissements. Figure 1 shows an overlay of the two administrative structures.

Figure 1: Overlay of RGPH 2002 (grey) and the RGPHAE 2013 (green).
The 552 communes from the RGPHAE 2013 are perfect subsets of the 103 arrondissements of the RGPH 2002.

3.2 Multidimensional poverty in Senegal

The global multidimensional poverty index (MPI) for Senegal based on [Alkire2019The2019] is calculated considering the incidence of poverty and the average deprivation share :

The MPI varies from 0 to 1 on a continuous scale, where 1 represents the highest level of MPI poverty. As SPREE requires the target indicator to be categorical, we focus on the incidence (also called MPI headcount ratio) in this case study. The MPI headcount ratio is defined as the proportion of people identified as multidimensional poor. It is calculated based on the (weighted) number of deprivations that they experience (also called deprivation score). The MPI is composed of ten indicators measuring different dimensions of deprivation. A household with a deprivation score of at least 1/3 is considered multidimensional poor. The deprivation indicators with the respective weight are summarized in Table 1.

Dimension Indicator () Weight () Available
Health Nutrition 1/6 No
Child mortality* 1/6 Yes
Education Years of schooling 1/6 Yes
School attendance 1/6 Yes
Living standards Cooking fuel 1/18 Yes
Sanitation 1/18 Yes
Drinking water 1/18 Yes
Electricity 1/18 Yes
Housing 1/18 Yes
Assets 1/18 Yes
Table 1: Dimensions and weights for measuring the MPI.
*As data on ‘nutrition’ is not available, ‘child mortality’ receives the full weight of the ‘health’ dimension (1/3).

The percentage contribution of indicator to the overall MPI headcount ratio is defined as:

Indicator-specific contributions to the overall MPI headcount ratio are particularly interesting for policy-making as they rank drivers of multidimensional poverty along the usual responsibilities of line ministries (e.g. education, energy, health, infrastructure etc.) in a simple-to-visualize and easy-to-understand manner. For more details, we refer to [Alkire2019The2019]. Although the survey data used in this application provide information for all MPI indicators, the census data lacks of the ‘nutrition’ indicator from the ‘health’ dimension. Even though the situation when there is lack of information in the census can be addressed within SPREE as shown by [Isidro2016ExtendedPovertyb], we focus on the nine remaining indicators for the sake of simplicity. Therefore, the indicator ‘child mortality’ in the ‘health’ dimension receives a weight of 1/3 instead of 1/6 (see Table 1). The multidimensional headcount ratio () in 2013 is shown for subgroups by région in Table 2 and by commune in Figure 2.

Figure 2: MPI headcount ratio in 2013 by commune.
The unweighted commune-level MPI headcount ratio in Senegal in 2013 averages at 0.75 (white) with a minimum at 0.1 (dark brown) and a maximum at 1 (dark green).
Région Région
Dakar 51.8 50.3 Louga 69.9 57.4
Diourbel 69.7 58.5 Matam 80.6 70.7
Fatick 75.8 62.9 Saint-Louis 68.6 59.7
Kaffrine 81.9 65.5 Sedhiou 84.6 74.2
Kaolack 72.6 58.4 Tambacounda 85.6 71.8
Kedougou 81.8 72.7 Thies 64.2 55.8
Kolda 82.7 65.9 Ziguinchor 65.3 61.1
Table 2: Regional-level MPI headcount ratios by sub-group from 2013 census.
Headcount ratios are given for the overall population () and the population living in a female-headed household ().

3.3 Data sources

3.3.1 Recensement Général de la Population et de l’Habitat, de l’Agriculture et de l’Elevage (RGPHAE) 2013

Data collection for the latest census took place between November 19th, 2013 and December 14th, 2013. Unit-level data of the population and housing census with 10% of the households sampled within each of the 17,145 census districts is provided for scientific use via the microdata portal of the Agence Nationale de la Statistique and de la Démographie (ANSD) - the Senegalese national statistical office. The RGPHAE 2013 provides data for all MPI indicators but not for ‘nutrition’. Some indicators suffer of item nonreponse that apparently do not follow a missing completely at random (MCAR) scheme. Table 3

provides an overview over the missing data treatment in the RGPHAE 2013, where can be seen that the population part of RGPHAE 2013 exhibits large-scale item-nonresponse. The strong imputation effects in these indicators are somehow expected due their respective definition. A household is deprived in the indicator ‘child mortality’ if any child has died in the household. Thus, the higher the coverage of individuals within the household, the more likely it is to fulfill the criteria of deprivation. Same holds for the indicator ‘school attendance’, which defines a household as deprived if any school-aged child is not attending school up to class eight. The contrary holds for the indicator ‘years of schooling’, which defines a household as deprived if no household member has completed six years of schooling. The more household members are covered, the more likely it is for a household not to fulfill the criteria of deprivation. We address the item nonresponse by imputing missing values in a single imputation using chained equations (see e.g.

[White2011MultiplePractice]). Even though we acknowledge that imputation introduces additional uncertainty, we do not explicitly account for it as this goes beyond the scope of the paper. Consequently, we put it as further research.

Indicator % missing % deprived % deprived (imputed)
Nutrition - - -
Child mortality 65.4 27.9 53.6
Years of schooling 60.6 15.1 11.3
School attendance 60.6 40.8 57.4
Cooking fuel 0.0 65.8 65.8
Sanitation 0.0 30.8 30.8
Drinking water 0.0 15.7 15.7
Electricity 0.0 38.9 38.9
Housing 0.0 26.2 26.2
Assets 0.0 16.6 16.6
Table 3: Missing MPI data and imputation effects in the RGPHAE 2013.

3.3.2 Recensement Général de la Population et de l’Habitat (RGPH) 2002

Similar to RGPHAE 2013, the Recensement Général de la Population et de l’Habitat 2002 (RGPH 2002) data is available at the unit-level with 10% of the households sampled within each of the census districts for scientific use via the microdata portal of ANSD. As with RGPHAE 2013 data, missing values are treated by single imputation using chained equations.

3.3.3 Household survey data

In this case study, we use data from Senegalese Demographic and Health Surveys (DHS) collected for the years 2013 and beyond. The DHS program is mainly funded by the United States Agency for International Development (USAID) with the aim of provide accurate and reliable indicators of population, health, and nutrition at national, residence (urban-rural) and regional levels (NUTS1). Details on the respective survey rounds are summarized in Table 4.

Year Contract phase Data collection Sample size Clusters
2013 VI Jan-Oct 4 174 200
2014 VII Jan-Oct 8 406 400
2015 VII Jan-Oct 4 511 214
2016 VII Jan-Oct 4 437 214
2017 VII Apr-Dec 8 380 400
2018 VIII Jun-Dec 4 592 214
2019 VIII Apr-Dec 4 538 214
Table 4: Description of the DHS data set available.

Indicator definitions and/or response options slightly differ between DHS surveys and RGPHAE 2013 leading to differences in the results as shown in Figure 3 and Table 5. For example, while the DHS surveys provide data on the time travelled for safe drinking water, RGPHAE 2013 does not. Therefore, we do not account for the time travelled in the respective indicator even though it has a major impact on the deprivation score for drinking water in Senegal. Nevertheless, regional-level correlations between indicators of the DHS 2013 and RGPHAE 2013 are generally quite strong with an average of 0.68 ranging from 0.40 for ‘assets’ to 0.99 for ‘cooking fuel’.

Indicator RGPHAE DHS
2013 ’13 ’14 ’15 ’16 ’17 ’18 ’19
Child mortality 53.9 41.4 39.9 39.4 36.0 37.6 36.1 32.2
Years of schooling 10.0 36.2 36.1 40.0 38.6 33.0 34.8 37.0
School attendance 64.4 80.8 79.7 74.4 76.5 76.8 76.8 77.9
Cooking fuel 75.8 81.0 78.7 79.0 74.0 22.5 20.9 21.3
Sanitation 35.2 42.1 41.1 45.4 41.4 36.5 30.5 31.8
Drinking water 17.5 24.1 21.7 25.2 23.2 22.5 21.9 23.4
Electricity 43.3 47.0 43.3 41.7 37.0 38.3 33.8 29.4
Housing 27.9 26.5 26.9 22.4 22.5 21.8 14.8 15.1
Assets 12.4 11.5 11.0 12.5 12.4 11.5 10.0 10.8
MPI headcount 68.4 69.0 67.9 67.5 64.7 59.3 58.1 57.8
Table 5: Overview over indicator-specific headcount ratios by data source.
Figure 3: Distribution of household-level deprivation scores by data source.
Cutoffs are 0.2 (vulnerable to multidimensional poverty), 0.33 (MPI-poor) and 0.5 (severely MPI-poor), respectively.

Differences between the categories available to create the MPI indicators lead also to differences in the distribution of the deprivation scores when comparing RGPHAE 2013 and the DHS 2013. For instance, while the census asks whether public sanitation facilities are used, the DHS asks whether sanitation facilities are shared (which could also occur privately among a small number of households).

3.3.4 Demographic projections

Demographic projections inform about the possible future demographic composition of a country. Usually, demographic projections take the latest census as base population and use hypotheses on the future development of fertility, mortality and migration in a country. In Senegal, demographic projections are produced by ANSD disaggregated by 5-year age groups, gender, region, and urban status based on RGPHAE 2013 using the software SPECTRUM following the cohort component method ([united1956manual]):

Population projections are made on the regional level, which are then distributed via fixed population shares determined during the census year to lower administrative levels (proportion method ([united1956manual])). As reliable regional-level data on migration is not available, the migration pattern captured in the census is treated as static over time. For estimating current fertility and death trends, the ANSD uses regional-level survey data, usually from the latest DHS. For details on the methodology applied to create population projections in Senegal, we refer to [AgenceNationaledelaStatistiqueetdelaDemographie2013Rapport2013-20631].

3.3.5 Satellite Imagery

In this case study, we use satellite imagery provided by [] for estimating dynamic subnational population shares on a frequent basis. The idea behind this is to account for population shifts occurring within regions that are so far not captured via the proportion method. By using the Dakar région as example, Figure 4 shows the evolution of the population shares from 2002 to 2013. Figure 3(a) shows how the population shares have changed based on population counts from RGPH 2002 and RGPHAE 2013, and Figure 3(b) shows changes in the population shares based on RGPH 2002 and satellite imagery-derived population estimates from WorldPop. The proportion method (not depicted) assumes fixed population shares, i.e. horizontal lines, across years.

(a) RGPHAE 2013
(b) Satellite imagery-based population estimates
Figure 4: Development of the small-area population shares between 2002 and 2013 for the Dakar region.
The within-region population shares always add up to 100%.

WorldPop data are provided in tagged image file format (TIFF) with a pixel representing roughly a 100m 100m grid square in an open data repository under CC4.0 licence ([]). We aggregate the pixel values via their centroids to the respective administrative areas in Senegal. For discussions on the impact of allocation uncertainty across different area systems we refer to [Koebe2020BetterModelling] and [Gro2020SwitchingBerlin]

. While WorldPop already provides population estimates for Senegal in their open data repository, we opt to slightly modify their approach to align it with our case study. The reasons for this are two-fold: First, the original approach maps RGPHAE 2013 data dasymetrically via a pixel-level weighting layer derived from a hierarchical random forest model trained on départment-level data from the same census and satellite imagery-based covariates from 2013. As Figures

4(a) to 4(d) show, accuracy of the satellite imagery-derived population estimates steeply decline for small areas, hinting at little spatial robustness of the underlying estimation method. Second, temporal trends in the relationship between population numbers and satellite imagery-based covariates are not accounted for as model training took place at one point in time. For example, while there were on average 1.41 inhabitants per pixel identifying a built-settlement in 2002, this number reduced to 0.88 in 2013. While reasons for that could be manifold (e.g. growth in buildings not used for housing or in per capita housing space) it shows that this could lead to an under- or over-estimation of the population count in 2002 depending on the direction and size of the coefficient.

(a) Régions (14)
(b) Départments (45)
(c) Arrondissements (103)
(d) Communes (552)
(e) Régions (14)
(f) Départments (45)
(g) Arrondissements (103)
(h) Communes (552)
Figure 5: Accuracy of satellite imagery-based population estimates across geographies.
Satellite imagery-derived population estimates (predicted) are compared with RGPHAE 2013 population counts (actual) for original WorldPop estimates trained on département-level (a-d) and recalculated estimates trained on the commune-level (e-h).

In contrast, the validation study sets out to update RGPH 2002 tables to 2013, so training needs to take place on 2002 data to make the dynamic population shares comparable to the fixed approach. Moreover, as modelling exercises generally benefit from additional data points, we extend the training phase from originally 45 départements to 103 arrondissements and 552 communes in the validation study and the application, respectively. Comparing Figures 4(a) to 4(d) with Figures 4(e) to 4(h) illustrate nicely that this holds true in our case study as well. For further details on the covariates, the preprocessing of the satellite imagery-derived indicators and the hierarchical random forest model used in WorldPop, we refer to [Stevens2015DisaggregatingData] and [Fick2017WorldClimAreas]. Table 6 summarizes the different approaches.

Approach Training level Prediction level Distribution level
WorldPop Département, 2013 Pixel, 2013 Département, 2013
Validation Arrondissement, 2002 Arrondissement, 2013 Région, 2013
Application Commune, 2013 Commune, 2013-20 Région, 2013-20
Table 6: WorldPop population estimates and their modifications.

For the validation study, we modify the original approach to fit the time window 2002 to 2013 and to make the results comparable with the proportion method and the application setup. Specifically, we use arrondissement-level data from 2002 (RPGH 2002 and satellite imagery-based covariates from 2002) for fitting the hierarchical random forest model. This way, we make better use of more fine-granular training data and we align the time window for training of the dynamic and the fixed approach. In a next step, we predict the weighting layer for the arrondissement-level of the year 2013 using satellite imagery-based covariates from 2013 and use it for distributing regional-level RGPHAE 2013 population counts to the arrondissements. This allows us to account for both spatial and temporal uncertainty in the evaluation. In addition, initial tests indicate that creating the weighting layer for very small areas, i.e. pixels, introduces much instability also seen in large-area aggregates (see Figure 5). For example, population estimates for 2013 using the validation setup described above with an arrondissement-level weighting layer correlates highly at with the RGPHAE 2013 population counts. In contrast, population estimates derived the same way, but with a pixel-level weighting layer correlate at . Therefore, we strike a balance here between spatial disaggregation and robustness by opting for small areas for which we can do both training and testing on. In addition, we compared the accuracy of the population estimates from the hierarchical Bayes approach proposed by [leasure2020national] with the results from the hierarchical random forest, however, without improvements in accuracy.

For the application, we use the approach from the validation study, but apply it on the commune-level for the years 2013 to 2020. Specifically, region-level population projections are mapped to communes using a commune-level weighting layer derived from a hierarchical random forest model trained on commune-level RGPHAE 2013 counts. Not all covariates are available on an annual basis until 2020. In these cases, we consider the latest data available for weighting layer prediction.

4 Validation

The contribution of this paper is to allow subnational population shares to vary over time (in contrast to fixed shares via the proportion method). For example, 2% of the population of area A lives in small area A1 during the first census. The proportion method assumes that also in the following years 2% of the population of area A will live in small area A1, regardless of the actual situation on the ground. Thus, the subnational population shares are considered as fixed. In contrast, we propose to use auxiliary data to capture subnational population shifts during census years. Thus, the subnational population shares can vary over time; we consider them to be dynamic. Arrondissement-level population shares within regions have changed significantly between 2002 and 2013 with growth rates ranging from -91% to +48% with the absolute change averaging at 15.7%. Figure 3(a) illustrates the true changes in population shares for the ten arrondissements within the région Dakar. Fixed population shares would require the lines to be horizontal, however, this assumption clearly does not hold. The question then is if auxiliary information, in our case satellite imagery, are capable to capture that change more accurately and whether better accuracy translates into the expected improvements for census updating. Concerning the first point, Figure 3(b) gives initial indication for Dakar that this might be the case.

To examine the overall value added of the proposed methodology we use two census periods in which we update the earlier census (denoted without a time index) to the year of later census (which we denote with a time index ) and evaluate our contribution vis-à-vis existing approaches. Specifically, we update the arrondissement-level MPI headcount ratio disaggregated by sex of the head of household extracted from RGPH 2002 to 2013. To do so, we use the DHS 2013 for updating the column margins and RGPH 2002 and population estimates from satellite imagery for updating the row margins with fixed and dynamic population shares, respectively. The updated MPI headcount ratios are evaluated against results from RGPHAE 2013 in terms of relative bias and relative root MSE (RMSE) across arrondissements using simulation rounds. Therefore, we generate replicates of the census composition from RGPH 2002 assuming an independent multinomial distribution in each small area with parameters and . The population row margin is represented as and is obtained using a Poisson distribution for each small area independently. The second parameter is defined as . This same process is applied to the composition from RGPHAE 2013 () that we use for evaluating the updated MPI headcount ratios. Since the idea is to compare the performance of SPREE using the dynamic approach (Equation 3) in contrast to the fixed approach (Equation 2), row margins are generated in two ways: For the fixed approach, we calculate the arrondissement-level population count of 2013 using regional-level population counts from demographic projections and arrondissement-level population shares from each census composition replicate as described in Equation 2. For the dynamic approach, we generate replicates of the population estimates created using the hierarchical random forest approach described in Section 3.3.5. Following Equation 3, we use the estimated population shares to distribute regional-level population counts to get arrondissement-level population counts for each simulation round . Then, we obtain replicates of the column margin by simple resampling from each primary sampling unit of the DHS 2013. Finally, we use , and the respective to generate arrondissement-level point estimates of the MPI headcount ratio in 2013, which we then compare against from RGPHAE 2013 using the relative bias and empirical RMSE defined as:


Naturally, fixed population shares provide very good accuracy for areas exhibiting little change and poor accuracy for areas exhibiting large fluctuations in the inter-censal population. This pattern indicates that dynamic population margins may be able to improve estimates for areas with large changes in population shares. Table 7

supports both hypotheses by grouping the arrondissement-level accuracy measurements into quartiles along the true population change.

Quartile Relative Bias (in %)
dynamic fixed
Lowest 17.95 28.99
2nd 18.41 9.98
3rd 7.57 -5.44
Highest 0.68 -19.60
Table 7: Accuracy of estimated population shares in 2013.
Arrondissements ordered by changes in the true population share between 2002 and 2013. The accuracy is evaluated against the true population shares from RGPHAE 2013.

For example, for the 50% of the 103 arrondissements with the smallest changes in population shares between 2002 and 2013 (2nd and 3rd group), the fixed approach outperforms the dynamic approach. For the arrondissements exhibiting the strongest changes in population share, the dynamic approach provides better accuracy than the fixed approach. A potential reason why both approaches, but especially the dynamic approach, exhibit a tendency to overestimate the changes in the population shares is that the underlying population counts are naturally left-censored at zero. This may induce an upward bias in the population estimates, especially for sparsely populated areas and for population margins with comparatively high volatility.

In total, considering satellite imagery-based population estimates for updating the population margin in SPREE provides more accurate population shares for 42 of the 103 arrondissements in this validation study. Therefore, we see indication that satellite imagery-based population estimates can improve existing approaches used in census updating for deriving small area population margins.

One limitation here is that the knowledge on the question of which small areas show large changes of subnational population shares is usually not known ex-ante. Intuitively, one approach to overcome this could be to use dynamic population shares for areas for which the estimated changes are large. However, volatile satellite imagery-based population margins make this an imperfect selection criteria. Therefore, we opt for a more conservative approach in the application by selecting those small areas for which to use the dynamic population shares that are situated within those 25% of regions showing the strongest population changes following the regional-level estimates from the official population projections.

Figure 6 and Table 8 show the results of the simulation. The overall correlation of the estimated MPI headcount ratio for 2013 and the true headcount ratio of 2013 as depicted in Figure 6 is moderate for both proportion types. While the Pearson correlation coefficient reflects the overall results from Table 7, the differences between the dynamic and fixed approach are less striking. A potential reason is that SPREE builds on multiple inputs (represented by the allocation structure and the association structure), so changes in one of the inputs may not affect the SPREE output to the same extent.

Figure 6: Correlation of estimated and true arrondissement-level MPI headcount ratio in 2013.
Arrondissements are grouped along the quartiles of the changes in true population shares between 2002 and 2013.

Looking at the relative bias and the relative RMSE in Table 8, the dynamic approach performs better in most of the cases. The comparatively higher relative bias in the second group can potentially be explained with zero counts in the underlying census composition. As SPREE performs less well in the presence of zero counts and as half of the zero counts in the census composition are located in the second group a deterioration of model performance can be expected.

Relative bias (in %)
Quartile Type Q2.5 Q25 Median Mean Q75 Q97.5
Lowest Dynamic -29.55 -11.94 1.23 -1.96 7.60 22.13
Fixed -28.92 -10.39 3.94 -0.24 8.64 22.87
2nd Dynamic -14.00 4.69 15.79 15.32 27.94 43.22
Fixed -13.57 5.05 16.87 16.02 28.55 43.38
3rd Dynamic -24.44 -0.60 2.58 6.89 16.88 36.99
Fixed -24.15 0.19 4.40 7.74 17.18 38.17
Highest Dynamic -30.54 -6.72 1.74 3.82 18.93 34.55
Fixed -29.83 -4.28 5.50 5.58 19.37 34.83
Relative RMSE (in %)
Quartile Type Q2.5 Q25 Median Mean Q75 Q97.5
Lowest Dynamic 7.99 13.15 17.42 18.70 23.82 32.71
Fixed 8.44 12.76 17.42 18.23 21.96 32.12
2nd Dynamic 9.12 13.81 22.51 24.09 34.36 50.96
Fixed 8.76 14.23 22.55 24.25 34.60 51.02
3rd Dynamic 5.91 10.02 18.36 19.99 28.46 42.52
Fixed 5.82 9.93 18.40 20.04 28.55 43.25
Highest Dynamic 9.06 16.03 23.16 23.59 30.25 43.14
Fixed 8.90 16.73 22.61 23.64 32.06 43.54
Table 8: Performance of the 2013 census updates for the MPI headcount ratio in Senegal, across arrondissements.

A possible reason why the differences between the dynamic and the fixed approach are less striking in terms of relative RMSE is that the resampling procedure of the respective population margins differs. While the fixed population margin is solely derived from census data, which is resampled following a Poisson and a Multinomial distribution in our setup, the procedure to obtain the dynamic population margin based on satellite imagery considers a simple resampling in order to account for potential non-trivial uncertainty in the pre-processing. This may lead to efficiency losses in the dynamic approach.

In summary, the results of the validation study show that the dynamic approach using satellite imagery from WorldPop as auxiliary information can improve the accuracy of both the population margin (relative bias) and the census updates (in Pearson correlation coefficient), especially for areas with large changes in population shares. For those areas with minor changes in their population shares, the fixed approach is recommended.

5 Application

In this section we present the results of our case study where we produce updated MPI headcount ratios for all 552 communes in Senegal for the years 2013 to 2020. We apply the semiparametric bootstrap procedure with explained in Section 2.3 to obtain uncertainty measures for the estimated MPI headcount ratios.

Has the multidimensional headcount ratio of individuals living in female-headed households improved more or less than the one in male-headed households since 2013? As shown in Table 2, individuals living in female-headed households in Senegal are less likely to be multidimensional poor than individuals living in male-headed households. Consequently, it is expected that female-headed households may profit on average less from the social safety net program PNBSF. Figure 7 depicts the difference of growth rates in the MPI headcount ratio between individuals living in female-headed households and individuals living in male-headed households using the year 2013 as baseline. The figure and the national-level point estimates of the MPI headcount ratio in Tables 9 and 10 confirm that the MPI headcount ratio of female-headed households has improved on average less since 2013 than of male-headed households (represented by the shades of brown in Figure 7).

(a) 2013 (b) 2014 (c) 2015 (d) 2016 (e) 2017 (f) 2018 (g) 2019 (h) 2020
Figure 7: Evolution of the updated MPI headcount ratio of individuals living in female-headed households vs. male-headed households.

Table 9 shows that the estimated MPI headcount ratio for female-headed households has been constantly decreasing from 2013 to 2020, with a greater reduction between 2016 and 2018. As of 2018, the incidence of multidimensional poverty of this specific population in Senegal has remained stagnant (61%). The corresponding coefficients of variation (CV) indicate that the updated incidence of multidimensional poverty are acceptable considering traditional thresholds (below 20%).

Year H Coefficient of variation (in %)
(est.) Q2.5 Q25 Median Mean Q75 Q97.5
2013 0.70 0.13 5.26 9.67 12.68 14.22 29.18
2014 0.69 0.08 3.77 7.19 7.55 10.04 21.21
2015 0.69 0.08 4.46 8.08 9.25 12.01 20.94
2016 0.69 0.08 4.26 8.18 9.11 12.93 21.42
2017 0.65 0.06 3.80 6.69 6.93 9.49 16.12
2018 0.61 0.09 6.33 10.81 10.69 14.18 24.52
2019 0.61 0.13 6.52 10.62 11.08 13.88 28.72
2020 0.61 0.12 6.62 10.52 10.98 13.89 27.65
Table 9: Estimated headcount ratio and its coefficient of variation for female-headed households, across communes.

In comparison, the MPI headcount ratio for male-headed households (see Table 10) decreased by 13 percentage points over the same period of time showing considerable smaller CVs than the female-headed households. This result is expected since the the sample size of male-headed households is considerably larger (approximately 77.7% of Senegalese households are headed by men).

Year H Coefficient of variation (in %)
(est.) Q2.5 Q25 Median Mean Q75 Q97.5
2013 0.77 0.66 2.17 3.95 4.43 6.29 9.77
2014 0.77 0.51 1.61 2.85 3.38 4.96 7.67
2015 0.77 0.61 2.08 3.74 4.44 6.64 9.91
2016 0.74 0.58 2.44 4.40 4.96 7.38 10.90
2017 0.68 0.84 2.56 3.94 4.20 5.97 8.11
2018 0.66 0.86 3.32 5.48 6.03 8.93 12.14
2019 0.64 1.42 4.06 6.13 6.45 8.86 12.32
2020 0.64 1.45 4.06 6.11 6.50 8.99 12.48
Table 10: Estimated headcount ratio and its coefficient of variation for male-headed households, across communes.

Looking at the indicator-specific contributions to the MPI headcount ratio at a national-level, decreases in multidimensional poverty are mainly due to improved living conditions (see Figure 7(a)). Notably, deprivation in the indicator on ‘years of schooling’ has increased for individuals living in female-headed households with ‘school attendance’ remaining a main driver of multidimensional poverty.

In order to drill down to local realisations of national-level trends, we apply SPREE following the hybrid approach laid out in Section 4, i.e. apply the dynamic approach in those areas with bigger changes in their population shares and the fixed approach for those with less changes. Specifically, the regions Kaffrine, Matam and Tambacounda show the highest population change. Thus, we apply the dynamic margins for the 105 communes located in these regions. Figure 7(b) illustrates the development of indicator-specific contributions to the commune-level MPI headcount ratio of the commune ‘Yoff’ in the region Dakar from 2013 to 2020 as an example. While in Yoff living conditions as drivers of multidimensional poverty steadily declined over the years similar to the national trend, also school attendance has improved, however, still remaining the biggest driver of commune-level MPI poverty.

(a) Senegal
(b) Yoff, Dakar, Senegal
Figure 8: Indicator-specific contributions to the uncensored MPI headcount ratio for individuals living in female-headed households in Senegal and in the commune Yoff located in the region Dakar, specifically.

In this setting neither direct nor indirect evaluation of commune-level census updates is possible due to the lack of appropriate commune-level evaluation data after the census year 2013 on the one hand and due to the fact that SPREE results reproduce region-level survey results by design on the other.

6 Conclusion

This study has demonstrated in a case study how auxiliary information such as satellite imagery can be used to potentially improve the availability of relevant key statistics for small areas between census years by strengthening the population margins especially in regions exhibiting strong population growth. This is particularly important for local-level policy-making as censuses and surveys fail to provide reliable and up-to-date data on small areas. This study has laid out a path to fill this gap by proposing a unified framework to combine established census updating techniques with recent approaches that add spatial granularity to existing data using novel data sources. We discussed both power and limitations of the proposal by providing annual MPI headcount ratio updates including indicator-specific contributions for the years 2013 to 2020 in the 552 communes of Senegal.

First and foremost, the value added vis-à-vis existing census updating methodologies is largely determined by the quality of the auxiliary data. The simulation showed that conveniently preprocessed remote sensing products such as the population grids including the corresponding covariates currently still lack the spatial robustness needed to add value to existing approaches. While this challenge is likely to be solved in the mid-term by the rapidly evolving market of remote sensing products, it may require considerable effort in the short-term to overcome this quality gap.

Second, rigorously addressing the uncertainty involved in the estimation process demands further research as for many promising auxiliary data sources the data generating process is beyond the control of a national statistical office. In addition, census and survey inputs sometimes require imputation to address high levels of non-response. While we took a first step in this paper to capture parts of the uncertainty by resampling the inputs, it would be interesting to incorporate uncertainty in the explanatory variables and effects from (multiple) imputation more formally.

Third, as we showcased in the case study, slightly varying definitions of the indicator of interest across data sources may induce additional uncertainty into the updating process, so harmonisation along the lines of [Isidro2016ExtendedPovertyb] is clearly advised. This approach could also be used to obtain a complete MPI when some of the components are not present in the census, which is the case in the setup of this paper.

Fourth, under the SPREE approach, census, survey and auxiliary data must have the same structure in terms of the disaggregation dimensions. However, considering that a census is conducted in long intervals, accounting for re-structurings such as land reforms that alter the administrative delineation, takes relevance. High-resolution population grids allow for a re-allocation of the population across different geographies under certain assumptions, however, a more general approach is desirable.

Fifth, the applicability of the SPREE could also increase significantly by allowing for non-categorical indicators and non-linear relationships. Here, copulas as used in [Rocher2019EstimatingModels] may provide a powerful vehicle to generalize SPREE further.

Conflict of interest

The authors have declared that no competing interests exist.