In 2015, the United Nations set forth seventeen objectives to “end poverty, protect the planet and ensure prosperity for all” known as the Sustainable Development Goals (SGD) (UN, 2015)
. To monitor progress and ultimately achieve these objectives, accurate population statistics are essential. It is estimated that currently 300-350 million people worldwide are not included in their country’s official population document, which hurts the measurement of SGD progress(Carr-Hill, 2013). The ability to quickly and cost-effectively produce an accurate population map for a country has a multitude of benefits. Those missing populations are more likely to be marginalized communities which already do not receive sufficient resources from the government (UNICEF, 2016). An accurate population distribution is an essential basis for socioeconomic statistics, such as food, water, and energy demand in different regions of a country, which influence the policy-making and spending decisions of its government. Additionally, during natural disasters such as earthquakes and floods, an accurate population map can help organize rescue efforts more quickly and effectively. For regions with high infectious disease rates, a fine-grained population map also helps to prevent the spread of infectious diseases to locations with dense population (Tatem et al., 2012; Hay et al., 2005).
However, creating a population map with high accuracy and high resolution is a challenging problem. Traditionally, it is done by performing a high-cost national census. The USAID Demographic and Health Survey (DHS) program performs surveys for developing countries typically every 5 years, and each survey costs anywhere from 1.1 million to 9.7 million USD (Doupe et al., 2016). The census surveys are even more expensive in developed countries like Europe, with a median cost of USD 5.57 per capita in 2010 (UN, 2014). For some countries with financial difficulties or political instability, the census is carried out less frequently, as few as once every few decades (UN, 2016a, b). Reliance on out-of-date population statistics can lead to significant errors if used for policy making or resource allocation.
In this project, we aim to predict the population density of rural villages of India from high-resolution satellite imagery by utilizing Convolutional Neural Network (CNN) models. With the availability of high-frequency satellite images, we can predict population density every few days, saving the costs of on-site census surveys and avoiding the inaccuracies caused by the infrequency of census surveys. We demonstrate state-of-the-art prediction performance in villages of all states in India. By using satellite images with 10-30 meter resolution, our best models can predict aggregated village population in one Subdistrict (akin to a US county) with of 0.93, and individual village population density with of 0.44.
2. Related Work
2.1. Traditional Methods
Traditionally, population mapping is divided into two approaches, population projection and population disaggregation. Population projection predicts the future or current population of a region based on historical data. For most cases, simple linear regression is sufficient for the projection(Smith, 1987). In more complex models, projections take into consideration historic population data, birth rates, registered vehicles, etc (Schmitt and Crosetti, 1954). These models were used to project US county population in 5 years, which have very high accuracy with of 0.99. However, they don’t provide information about the population distribution within each county.
The more challenging task is population disaggregation, which involves estimating the population distribution of a region given the total population. The most basic method is areal weighting/ interpolation, which assumes a uniform distribution across the region with a single population value(Goodchild et al., 1993). The Gridded Population of the World (GPW) uses areal weighting with a resolution of 30 arc-seconds (approximately 1 km at the equator) (CIESIN, 2010). There are also many tools which implement a weighted surface for estimating a population’s distribution, a technique called dasymetric weighting (Robinson et al., 2017). The Global Rural Urban Mapping Project (GRUMP) uses nightlight imagery to add urban and rural boundaries to GPW (Schneider et al., 2009). LandScan estimates the weighted surface (with 30 arc-seconds resolution) for population distribution based on land cover, roads, slope, urban areas, village locations (ORNL, 2011). AfriPop, AsiaPop, and AmeriPop are similar but for region-specific population disaggregation calculations, and they are combined in the 2013 WorldPop project (Worldpop, 2013), which has a higher resolution of 100 meters. These disaggregation methods are compared visually in Figure 1.
2.2. Machine Learning Methods
In addition to the above traditional GIS approach, machine learning algorithms have been proposed in recent years to obtain better population disaggregation results. A random forest approach was used to estimate the population at 100m resolution for Vietnam, Cambodia, and Kenya, using features similar to LandScan(Stevens et al., 2015). The Facebook Connectivity Lab used a tailored CNN model to detect man-made structures from satellite imagery with 0.5m resolution, which achieved average precision of 0.95 and recall of 0.91. They then redistributed the population in GPW evenly to the areas covered by human-made structures, and create population maps with 30 meter resolution for 18 countries, not including India (Tiecke et al., 2017).
Instead of disaggregation based on population estimates from census surveys, some CNN models are trained to estimate population directly from satellite imagery inputs. Doupe et al. combined Landsat-7 satellite imagery with (DMSP/OLS) nighttime lights as CNN input, and predicted the log normalized population density for an area of 8km (called LL-raw), where the ground-truth label was average population density of Sublocation (akin to a US county). The outputs were then converted into weights and used to create a weighted population density surface across the country with a single known total population (LL-distributed). The model was trained with 2002 Tanzanian Enumeration Areas and tested with 2009 Kenya Sublocations. During the test phase, the estimated population densities were averaged at the Sublocation level, and then compared to other methods. The results show that LL-raw has better accuracy than GRUMP and GWP estimates, and that LL-distributed outperforms a random forest model by 177% on RMSE (Doupe et al., 2016).
, but changed the model output from regression to classification of the power level of population. The model only uses Landsat imagery as input, and predicts population in the US with US Census Summary Grids data as ground-truth labels. The study divided the country into 15 regions, and trained an individual model for each region. The raw output feature vectors of the CNN model were first converted into population values for each input image, and then summed at the county level to produceCONVRAW
. The outputs of the CNN were also fed into a second layer gradient boosting model to get an improved population estimate for each county calledCONVAUG, where the census county population was used as labels. The results of CNN models achieved more than 0.9 against the ground truth, but still cannot perform better than the US government estimate based on historical census data (Robinson et al., 2017).
3.1. Population Dataset
Our ground-truth Indian population dataset comes from a census survey Socio-Economic Caste Census in the year 2011 (SECC, 2011). It includes more than 500,000 rural villages, covering 32 states, 619 districts, and 5724 sub-districts. It also provides the area of each surveyed village. Figure 2 shows the distribution of areas follows a power law. These areas are used to calculate population density for each village. Similar to previous papers, we log normalize population density values with base 2, because most villages have small population density, and only a few have large density. The original density input may cause the model to have less ability to predict villages with higher population density. The distribution of villages density after normalization is shown in Figure 3.
Inspecting the population datasets we observed some outlier values, such as a 100 kmarea with just one person. We assume these are due to data collection and handling mistakes, therefore we remove 1% of village data that had extreme population density values to prevent the model training from being affected by those outliers. More specifically, villages with top 0.5% highest density and bottom 0.5% lowest density were removed from the datasets.
3.2. Satellite Imagery
For each village, we prepare one image from each satellite imagery source such that the village center is found at the image center. The village population depends on its area but our images have fixed size covering the same area, therefore we use population density as the output. We obtained 2 sets of satellite imagery from 2011, the same year the survey was conducted. The first set is from Landsat-8, an updated satellite from Landsat-7 whose images are also used by papers from Robinson et al. in (2017) and Doupe et al. in (2016) (USGS, 2013). In contrast with previous papers which use most bands of Landsat, we use only Red, Green, and Blue (RGB) bands. The resulting images show the target regions in the same colors that humans see, and have 30-meter resolution. The second set is from Sentinel-1, a radar imaging satellite that measures ground surface reflectance, thereby capturing roads and roofs more accurately (due to their higher reflectance than natural land) (ESA, 2014). This is a new dataset not used in the previously mentioned papers. Sentinel-1 images have 10-meter resolution, and the raw channel values are converted to visualized RGB images to match Landsat-8. Both sets of images are converted to JPEG format from raw GeoTIFF files, which enables easy visualization and compression. Figure 4 displays examples of these images.
Since Landsat-8 and Sentinel-1 have different spatial resolutions, they are cropped to cover the same area. Based on the village area distribution as shown in Figure 2, the covered area of images is determined to be 20.25 km (4.5km 4.5km), so that more than 95% of villages are contained within a single satellite image.
3.3. Dataset Partition
We split the dataset of ground-truth population (outputs) and images (inputs) pairs into 70% training and 30% validation partitions. We took additional measures to avoid the overlap of training images with validation images, which may affect the reliability of the validation split. Overlaps are very likely because the total area of all satellite images (over 10 million km) we obtained is much larger than the total area of India (about million km). To address this issue, we partition the data at the subdistrict level. We split all subdistricts into 4007 training and 1717 validation subdistricts. The training partition only has images of villages that belong to training subdistricts, and similarly for the validation partition. However, it is still possible that images overlap along the boundary of two adjacent subdistricts, contaminating the split. Thus, we remove additional images from the training partition. We say a pair of images overlap if the distance between their centers is closer than half of the height or width of the image (2.25km). Approximately an additional overlapping images are removed from the training partition.
We propose a deep learning approach that uses satellite images as inputs to a Convolutional Neural Network (CNN) to predict population density. The ground-truth population data is used as the label to train the model. We start with the VGG16 architecture (Simonyan and Zisserman, 2014)
, using as input either Landsat-8 or Sentinel-1 RGB images. We use the implementation of VGG 16 from the TensorFlow Slim Library. To adapt the model for a regression problem, the model output size is set to be 1 for a single
population value. Additionally, the loss function in the model is changed to Mean Squared Error (MSE). Image inputs are resized to(width height
channels), and we apply image augmentation including random cropping and flipping during training. The model weights are initialized with pre-trained weights from the ILSVRC-2010-CLS ImageNet classification dataset, omitting the last (classification) layer.
To fully utilize multiple satellite image sources in this study, we design two custom CNN architectures. The first custom architecture, called Shallow Combo, is shown in Figure 5
. It is a modified version of VGG16 where Landsat-8 and Sentinel-1 images are input into the model separately as two branches. In each branch, two convolution layers and one max pooling layer (CONV1 in VGG16) are applied to each image. The two branches are then concatenated along the channel dimension. 11 convolutions filters are applied to halve the number of channels after concatenation, which fits the input shape of next convolutional layer. The rest of the VGG16 layers are applied to the merged branch. After the last fully connected layer of VGG16, one more fully connected layer of size 100 is added to the network. The layer is lastly used to predict the output population density. During training, weights of all convolutional layers and the first two fully connected layers are initialized with pre-trained weights from the basic VGG16 architecture trained on ImageNet.
The second custom CNN called Deep Combo is an improved version of Shallow Combo, also shown in Figure 5. The branches of Landsat-8 and Sentinel-1 are fed through all convolutional layers of VGG16, and concatenated along the channel dimension before the first fully connected layer. The same convolutions filters are applied to obtain the same input shape before the fully connected layer. ImageNet weights are also used for initialization until the second fully connected layer.
5. Experimental results
5.1. Evaluation Methods
We evaluate each model on the validation set on two levels: raw village level and aggregated subdistrict level. The raw village level predictions are per-village population density estimates from the model, which are compared with the ground-truth census population density of each village. This raw evaluation represents the most fine-grained comparison possible in our dataset. At the aggregated subdistrict level, the raw per-village density predictions are converted to the total village population using the village area from the survey. The village populations within each subdistrict are aggregated to a total population for that subdistrict. Evaluations at the aggregate levels are performed in previous papers, which we are going to compare with. In short, the village level evaluation uses population density values, while subdistrict level evaluation uses real population numbers.
In terms of evaluation metrics, we compare different models mainly based onand Pearson Correlation.
measures how much true variance is captured by the model, and attains a perfect value at 1. If a model makes constant average predictions, it will havescore of 0. Pearson Correlation measures the linear correlation between the predicted and true values, implying a total positive correlation when +1 and a total negative correlation when -1. We also use other metrics to compare results with previous papers, such as MAPE (Mean Absolute Percentage Error), %RMSE (percent Root Mean Squared Error).
5.2. Baseline: LandScan
To establish a baseline comparison for this project, we compared our ground truth population densities with those of LandScan, which is considered a community standard for global population distribution (ORNL, 2011). LandScan is a grid containing a numerical population estimate for every cell of size 30x30 arc-second in earth coordinates, which is around 1km on the equator (smaller at different latitudes). This estimate represents an average (over 24 hours), or ambient population distribution (not just sleeping location). For a single village, we take a square of 30 arc-second cells which most closely matches the villages area, as well as the village latitude and longitude. The population density of the village is calculated by dividing the LandScan population estimates within the cells with the area covered by the cells.
We use this method to obtain the LandScan estimate of population density for all villages in our validation dataset. The evaluation results are shown as LandScan model in Table 2. It has good performance in estimating aggregated population on subdistrict level, but it doesn’t perform well on per-village level. It is reasonable because LandScan is a traditional population disaggregation method using only the general features on the maps, such as rivers, roads, land cover types. It does not have fine-grained information as our models have from satellite imagery.
5.3. Single State Training
Due to the large amount of data, we initially generate a smaller dataset with villages in a single state, Gujarat. The small dataset contains around 13,000 villages, which allows us to compare different models quickly, and to tune the hyper-parameters of the CNN models. We experiment with different input sources and the different CNN architectures described above. We train the basic VGG-16 separately with Landsat-8 (L8) and Sentinel-1(S1) inputs. The L8 and S1 inputs are also concatenated along the image channels and feed into VGG-16 as single input (S1L8-Concat). Furthermore, Shallow Combo and Deep Combo architectures are used to process two type of satellite images separately. We use NVIDIA Tesla P100 GPU with 16G memory from Google Compute Engine for training. It sets the limit of input batch size to 48 for the Deep Combo architecture, thus this batch size is also used for other models to ensure a fair comparison. With several iterations, the optimal hyper-parameters are found as follows: learning rate: , exponential learning decay rate: , weight decay: 5x10, dropout: 0.8.
|Model||Village level||Subdistrict level|
The evaluations of the above 5 models are shown in Table 1. In general, the models have lower accuracies when predicting per-village level population densities, but the accuracy significantly improves when the individual predictions are aggregated for the subdistrict level evaluation. The S1 model has better performance than the L8 model, possibly due to its high resolution as well as that measuring ground reflectance by radar is more effective. However, simply concatenating L8 and S1 inputs in L8S1-Concat does not lead to further improvements, resulting in performance slightly worse than using S1 alone. This is likely due to training difficulties when using a combination of L8 and S1 images. In contrast, our Shallow Combo and Deep Combo architectures have significantly better performance. These results indicate that when using inputs from different sources, the CNN model needs to process them separately to extract their semantic features, so that useful information from both sides can be utilized.
5.4. All States Training
After training on a single state, we move on to train on around 350,000 villages in all 32 states in India. L8, S1, Shallow Combo and Deep Combo are trained respectively with the same hyper-parameters found when training in the single state case. The evaluation results are summarized in Table 2. With more data provided, all the models show significant improvements. All models exceed the performance of the LandScan baseline. Shallow Combo and Deep Combo models still outperform the basic VGG-16 models with a single image source. By comparing between two custom architectures, Shallow Combo has better performance on predicting population on a single village, while Deep Combo captures more general information in the region and has better accuracy when predicting aggregated population at the subdistrict level.
|Model||Village level||Subdistrict level|
5.5. Comparison with Prior Works
In this section, we compare our models with others from prior studies, using aggregated evaluation metrics. Table 3 summarizes the reported accuracies from the papers from Doupe et al. in (2016) with study area in Kenya, and from Robinson et al. in 2017 with study area in the United States. Since we do not use subdistrict level census population for either training or prediction improvement, to ensure the fair comparison, we compare to the models in the papers also without the assistance of additional data from the aggregated level, namely LL-raw from Doupe’s paper and CONVRAW from Robinson’s paper. Comparing with Robinson’s paper shows that even though we use a smaller (average) aggregation area in India, we still achieve significantly better and MAPE. The aggregation area in Doupe’s study is even smaller, but their model also has much higher errors compared to ours.
|Paper||Ours||Robinson 2017||Doupe 2016|
|Aggr. area||424 km||2584 km||88 km|
However, it is difficult to compare models trained and tested across different countries. To address this problem, we use the same methods from Robinson et al., and apply them to our India study area. Specifically, we trained a classification model using a single Landsat input. The categorical outputs of the CNN are in the bins of population density values for each Indian village. Following its CONVRAW approach, we convert the predicted class to population density, and aggregate the population at the subdistrict level. The predictions are still not as good as most of our models, which are shown as CONVRAW in Table 2. Its performance is close to our L8 model, which differs from it by using regression instead of classification. The comparison shows the regression model is better at predicting fine-grained population while classification is better for aggregate predictions.
Finally, we visualize the population mapping of different models in districts of India, the higher administrative level than subdistrict, as shown in Figure 6. It includes 533 districts that contain evaluation results from 1717 validation subdistricts. The visualization includes our Deep Combo and Shallow Combo model predictions compared with ground-truth population density distribution across districts in India. It also compares the prediction errors of the two models with those of the baseline LandScan and CONVRAW model from Robinson et al. Overall, both Deep Combo and Shallow Combo models map the rural village population of India very close to the true distribution. The prediction errors map further reveals that Deep Combo has better performance on high level aggregation than other models, while LandScan tends to underestimate the population and CONVRAW tends to overestimate.
The results of our CNN models on the population density prediction directly from satellite imagery are very promising. We see a higher accuracy than previous methods on both per-image raw level and aggregated region level. We attribute the success to the following reasons:
Additional Sentinel-1 satellite imagery that provides more information on human settlements.
Custom Shallow Combo and Deep Combo CNN architectures that better utilize different imagery sources.
A larger and fine-grained dataset for CNN training.
Better computational resources that enable larger image input size and batch size.
Our study largely achieves the initial goal of producing accurate population estimates enabling population mapping directly from satellite imagery, especially for rural areas. The results show that population prediction at a relatively coarse scale (such as district or subdistrict level) is quite accurate, while prediction at the village level directly from satellite imagery remains challenging. However, our village level predictions already have significantly better performance than traditional methods such as LandScan, and may still have room for improvement if images with higher resolution are available. Our population mapping models are likely useful for assisting governments in better providing for their citizens, improving resource allocation in natural disasters, aiding in infectious disease tracking, and reducing bias in progress measurement for the United Nations SDGs.
Acknowledgements.This work was supported by Data for Development Initiative at the Stanford Center on Global Poverty and Development. All data in this work was processed and shared by Paul Novosad (Dartmouth College) and Sam Asher (World Bank).
- Carr-Hill (2013) Roy Carr-Hill. 2013. Missing millions and measuring development progress. World Development 46 (2013), 30–44.
- CIESIN (2010) CIESIN. 2010. Gridded Population of the World (GPW), v4: Population Count Grid. http://sedac.ciesin.columbia.edu/data/collection/gpw-v4/methods/method1
- Doupe et al. (2016) Patrick Doupe, Emilie Bruzelius, James Faghmous, and Samuel G Ruchman. 2016. Equitable development through deep learning: The case of sub-national population density estimation. In Proceedings of the 7th Annual Symposium on Computing for Development. ACM, 6.
- ESA (2014) ESA. 2014. Sentinel-1 - Missions. https://sentinel.esa.int/web/sentinel/missions/sentinel-1
- Goodchild et al. (1993) Michael F Goodchild, Luc Anselin, and Uwe Deichmann. 1993. A framework for the areal interpo- lation of socioeconomic data. Environment and planning (1993).
- Hay et al. (2005) Simon I. Hay, Abdisalan M. Noor, Andrew Nelson, and Andrew J. Tatem. 2005. The accuracy of human population maps for public health application. Tropical medicine & international health : TM & IH 10 10 (2005), 1073–86.
- ORNL (2011) ORNL. 2011. LandScan - Oak Ridge National Laboratory. http://web.ornl.gov/sci/landscan/landscan_documentation.shtml#01
- Robinson et al. (2017) Caleb Robinson, Fred Hohman, and Bistra Dilkina. 2017. A Deep Learning Approach for Population Estimation from Satellite Imagery. CoRR abs/1708.09086 (2017). arXiv:1708.09086 http://arxiv.org/abs/1708.09086
- Schmitt and Crosetti (1954) Robert C. Schmitt and Albert H. Crosetti. 1954. Accuracy of the Ratio-Correlation Method for Estimating Postcensal Population. Land Economics 30, 3 (1954), 279–281. http://www.jstor.org/stable/3144384
- Schneider et al. (2009) Annemarie Schneider, Mark A Friedl, and David Potere. 2009. A new map of global urban extent from modis satellite data. Environmental Research Letters (2009).
- SECC (2011) SECC. 2011. Socio Economic and Caste Census-2011. http://secc.gov.in/welcome
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
- Smith (1987) Stanley K Smith. 1987. Tests of forecast accuracy and bias for county population projections. J. Amer. Statist. Assoc. (1987).
- Stevens et al. (2015) Forrest R Stevens, Andrea E Gaughan, Catherine Linard, and Andrew J Tatem. 2015. Disaggregating census data for population mapping using random forests with remotely-sensed and ancillary data. PloS one 10, 2 (2015), e0107042.
- Tatem et al. (2012) Andrew J. Tatem, Susana Adamo, Nita Bharti, Clara R. Burgert, Marcia Castro, Audrey Dorelien, Gunter Fink, Catherine Linard, Mendelsohn John, Livia Montana, Mark R. Montgomery, Andrew Nelson, Abdisalan M. Noor, Deepa Pindolia, Greg Yetman, and Deborah Balk. 2012. Mapping populations at risk: improving spatial demographic data for infectious disease modeling and metric derivation. Population Health Metrics 10, 1 (16 May 2012), 8. https://doi.org/10.1186/1478-7954-10-8
- Tiecke et al. (2017) Tobias G. Tiecke, Xianming Liu, Amy Zhang, Andreas Gros, Nan Li, Gregory Yetman, Talip Kilic, Siobhan Murray, Brian Blankespoor, Espen B. Prydz, and Hai-Anh H. Dang. 2017. Mapping the world population one building at a time. CoRR abs/1712.05839 (2017). arXiv:1712.05839 http://arxiv.org/abs/1712.05839
- UN (2014) UN. 2014. Measuring population and housing: practices of the unece countries in the 2010 round of censuses. Technical Report. United Nations.
- UN (2015) UN. 2015. Sustainable Development Goals. http://www.undp.org/content/undp/en/home/sustainable-development-goals.html
- UN (2016a) UN. 2016a. Census dates for all countries. https://unstats.un.org/unsd/demographic/sources/census/censusdates.htm
- UN (2016b) UN. 2016b. Census reaches vulnerable women and girls in a remote area of myanmar for the very first time. https://www.unfpa.org/news/census-reaches-vulnerable-women-and-girls-remote-area-myanmar-very-first-time
- UNICEF (2016) UNICEF. 2016. The state of the world’s children 2012: A fair chance for every child. Technical Report. UNICEF.
- USGS (2013) USGS. 2013. Landsat Missions. https://landsat.usgs.gov
- Worldpop (2013) Worldpop. 2013. Worldpop - What is WorldPop? http://www.worldpop.org.uk/
- WorldPop (2018) WorldPop. 2018. WorldPop :: Methods. https://www.worldpop.org/methods