Mapping Missing Population in Rural India: A Deep Learning Approach with Satellite Imagery

Millions of people worldwide are absent from their country's census. Accurate, current, and granular population metrics are critical to improving government allocation of resources, to measuring disease control, to responding to natural disasters, and to studying any aspect of human life in these communities. Satellite imagery can provide sufficient information to build a population map without the cost and time of a government census. We present two Convolutional Neural Network (CNN) architectures which efficiently and effectively combine satellite imagery inputs from multiple sources to accurately predict the population density of a region. In this paper, we use satellite imagery from rural villages in India and population labels from the 2011 SECC census. Our best model achieves better performance than previous papers as well as LandScan, a community standard for global population distribution.



There are no comments yet.


page 2

page 4

page 7


A Deep Learning Approach for Population Estimation from Satellite Imagery

Knowing where people live is a fundamental component of many decision ma...

Intercensal updating using structure-preserving methods and satellite imagery

Censuses are fundamental building blocks of most modern-day societies, y...

Slum Segmentation and Change Detection : A Deep Learning Approach

More than one billion people live in slums around the world. In some dev...

Satellite Images and Deep Learning to Identify Discrepancy in Mailing Addresses with Applications to Census 2020 in Houston

The accuracy and completeness of population estimation would significant...

Measuring Human and Economic Activity from Satellite Imagery to Support City-Scale Decision-Making during COVID-19 Pandemic

The COVID-19 outbreak forced governments worldwide to impose lockdowns a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In 2015, the United Nations set forth seventeen objectives to “end poverty, protect the planet and ensure prosperity for all” known as the Sustainable Development Goals (SGD) (UN, 2015)

. To monitor progress and ultimately achieve these objectives, accurate population statistics are essential. It is estimated that currently 300-350 million people worldwide are not included in their country’s official population document, which hurts the measurement of SGD progress

(Carr-Hill, 2013). The ability to quickly and cost-effectively produce an accurate population map for a country has a multitude of benefits. Those missing populations are more likely to be marginalized communities which already do not receive sufficient resources from the government (UNICEF, 2016). An accurate population distribution is an essential basis for socioeconomic statistics, such as food, water, and energy demand in different regions of a country, which influence the policy-making and spending decisions of its government. Additionally, during natural disasters such as earthquakes and floods, an accurate population map can help organize rescue efforts more quickly and effectively. For regions with high infectious disease rates, a fine-grained population map also helps to prevent the spread of infectious diseases to locations with dense population (Tatem et al., 2012; Hay et al., 2005).

However, creating a population map with high accuracy and high resolution is a challenging problem. Traditionally, it is done by performing a high-cost national census. The USAID Demographic and Health Survey (DHS) program performs surveys for developing countries typically every 5 years, and each survey costs anywhere from 1.1 million to 9.7 million USD (Doupe et al., 2016). The census surveys are even more expensive in developed countries like Europe, with a median cost of USD 5.57 per capita in 2010 (UN, 2014). For some countries with financial difficulties or political instability, the census is carried out less frequently, as few as once every few decades (UN, 2016a, b). Reliance on out-of-date population statistics can lead to significant errors if used for policy making or resource allocation.

In this project, we aim to predict the population density of rural villages of India from high-resolution satellite imagery by utilizing Convolutional Neural Network (CNN) models. With the availability of high-frequency satellite images, we can predict population density every few days, saving the costs of on-site census surveys and avoiding the inaccuracies caused by the infrequency of census surveys. We demonstrate state-of-the-art prediction performance in villages of all states in India. By using satellite images with 10-30 meter resolution, our best models can predict aggregated village population in one Subdistrict (akin to a US county) with of 0.93, and individual village population density with of 0.44.

2. Related Work

2.1. Traditional Methods

Traditionally, population mapping is divided into two approaches, population projection and population disaggregation. Population projection predicts the future or current population of a region based on historical data. For most cases, simple linear regression is sufficient for the projection

(Smith, 1987). In more complex models, projections take into consideration historic population data, birth rates, registered vehicles, etc (Schmitt and Crosetti, 1954). These models were used to project US county population in 5 years, which have very high accuracy with of 0.99. However, they don’t provide information about the population distribution within each county.

Figure 1. Population disaggregation visualization for WorldPop(A,D), GRUMP (B,E), LandScan (C,F). The upper 3 figures show a northeast region of Guinea along the Niger River; the lower 3 figures show the region around the largest city of Tanzania, Dar es Salaam. Figure from WorldPop (, 2018).

The more challenging task is population disaggregation, which involves estimating the population distribution of a region given the total population. The most basic method is areal weighting/ interpolation, which assumes a uniform distribution across the region with a single population value

(Goodchild et al., 1993). The Gridded Population of the World (GPW) uses areal weighting with a resolution of 30 arc-seconds (approximately 1 km at the equator) (CIESIN, 2010). There are also many tools which implement a weighted surface for estimating a population’s distribution, a technique called dasymetric weighting (Robinson et al., 2017). The Global Rural Urban Mapping Project (GRUMP) uses nightlight imagery to add urban and rural boundaries to GPW (Schneider et al., 2009). LandScan estimates the weighted surface (with 30 arc-seconds resolution) for population distribution based on land cover, roads, slope, urban areas, village locations (ORNL, 2011). AfriPop, AsiaPop, and AmeriPop are similar but for region-specific population disaggregation calculations, and they are combined in the 2013 WorldPop project (Worldpop, 2013), which has a higher resolution of 100 meters. These disaggregation methods are compared visually in Figure 1.

2.2. Machine Learning Methods

In addition to the above traditional GIS approach, machine learning algorithms have been proposed in recent years to obtain better population disaggregation results. A random forest approach was used to estimate the population at 100m resolution for Vietnam, Cambodia, and Kenya, using features similar to LandScan

(Stevens et al., 2015). The Facebook Connectivity Lab used a tailored CNN model to detect man-made structures from satellite imagery with 0.5m resolution, which achieved average precision of 0.95 and recall of 0.91. They then redistributed the population in GPW evenly to the areas covered by human-made structures, and create population maps with 30 meter resolution for 18 countries, not including India (Tiecke et al., 2017).

Instead of disaggregation based on population estimates from census surveys, some CNN models are trained to estimate population directly from satellite imagery inputs. Doupe et al. combined Landsat-7 satellite imagery with (DMSP/OLS) nighttime lights as CNN input, and predicted the log normalized population density for an area of 8km (called LL-raw), where the ground-truth label was average population density of Sublocation (akin to a US county). The outputs were then converted into weights and used to create a weighted population density surface across the country with a single known total population (LL-distributed). The model was trained with 2002 Tanzanian Enumeration Areas and tested with 2009 Kenya Sublocations. During the test phase, the estimated population densities were averaged at the Sublocation level, and then compared to other methods. The results show that LL-raw has better accuracy than GRUMP and GWP estimates, and that LL-distributed outperforms a random forest model by 177%  on RMSE (Doupe et al., 2016).

Robinson et al. in (2017) adopted a similar CNN approach to Doupe et al.

, but changed the model output from regression to classification of the power level of population. The model only uses Landsat imagery as input, and predicts population in the US with US Census Summary Grids data as ground-truth labels. The study divided the country into 15 regions, and trained an individual model for each region. The raw output feature vectors of the CNN model were first converted into population values for each input image, and then summed at the county level to produce


. The outputs of the CNN were also fed into a second layer gradient boosting model to get an improved population estimate for each county called

CONVAUG, where the census county population was used as labels. The results of CNN models achieved more than 0.9 against the ground truth, but still cannot perform better than the US government estimate based on historical census data (Robinson et al., 2017).

3. Data

3.1. Population Dataset

Our ground-truth Indian population dataset comes from a census survey Socio-Economic Caste Census in the year 2011 (SECC, 2011). It includes more than 500,000 rural villages, covering 32 states, 619 districts, and 5724 sub-districts. It also provides the area of each surveyed village. Figure 2 shows the distribution of areas follows a power law. These areas are used to calculate population density for each village. Similar to previous papers, we log normalize population density values with base 2, because most villages have small population density, and only a few have large density. The original density input may cause the model to have less ability to predict villages with higher population density. The distribution of villages density after normalization is shown in Figure 3.

Figure 2. Village area distribution.
Figure 3. Village population density distribution after normalization.

Inspecting the population datasets we observed some outlier values, such as a 100 km

area with just one person. We assume these are due to data collection and handling mistakes, therefore we remove 1% of village data that had extreme population density values to prevent the model training from being affected by those outliers. More specifically, villages with top 0.5% highest density and bottom 0.5% lowest density were removed from the datasets.

3.2. Satellite Imagery

For each village, we prepare one image from each satellite imagery source such that the village center is found at the image center. The village population depends on its area but our images have fixed size covering the same area, therefore we use population density as the output. We obtained 2 sets of satellite imagery from 2011, the same year the survey was conducted. The first set is from Landsat-8, an updated satellite from Landsat-7 whose images are also used by papers from Robinson et al. in (2017) and Doupe et al. in (2016) (USGS, 2013). In contrast with previous papers which use most bands of Landsat, we use only Red, Green, and Blue (RGB) bands. The resulting images show the target regions in the same colors that humans see, and have 30-meter resolution. The second set is from Sentinel-1, a radar imaging satellite that measures ground surface reflectance, thereby capturing roads and roofs more accurately (due to their higher reflectance than natural land) (ESA, 2014). This is a new dataset not used in the previously mentioned papers. Sentinel-1 images have 10-meter resolution, and the raw channel values are converted to visualized RGB images to match Landsat-8. Both sets of images are converted to JPEG format from raw GeoTIFF files, which enables easy visualization and compression. Figure 4 displays examples of these images.

Figure 4. Landsat-8 and Sentinel-1 imagery examples, with corresponding ground truth population density from the survey.
Figure 5. Custom CNN architecture Shallow Combo (upper) and Deep Combo (lower)

Since Landsat-8 and Sentinel-1 have different spatial resolutions, they are cropped to cover the same area. Based on the village area distribution as shown in Figure 2, the covered area of images is determined to be 20.25 km (4.5km 4.5km), so that more than 95% of villages are contained within a single satellite image.

3.3. Dataset Partition

We split the dataset of ground-truth population (outputs) and images (inputs) pairs into 70% training and 30% validation partitions. We took additional measures to avoid the overlap of training images with validation images, which may affect the reliability of the validation split. Overlaps are very likely because the total area of all satellite images (over 10 million km) we obtained is much larger than the total area of India (about million km). To address this issue, we partition the data at the subdistrict level. We split all subdistricts into 4007 training and 1717 validation subdistricts. The training partition only has images of villages that belong to training subdistricts, and similarly for the validation partition. However, it is still possible that images overlap along the boundary of two adjacent subdistricts, contaminating the split. Thus, we remove additional images from the training partition. We say a pair of images overlap if the distance between their centers is closer than half of the height or width of the image (2.25km). Approximately an additional overlapping images are removed from the training partition.

4. Method

We propose a deep learning approach that uses satellite images as inputs to a Convolutional Neural Network (CNN) to predict population density. The ground-truth population data is used as the label to train the model. We start with the VGG16 architecture (Simonyan and Zisserman, 2014)

, using as input either Landsat-8 or Sentinel-1 RGB images. We use the implementation of VGG 16 from the TensorFlow Slim Library. To adapt the model for a regression problem, the model output size is set to be 1 for a single

population value. Additionally, the loss function in the model is changed to Mean Squared Error (MSE). Image inputs are resized to

(width height

channels), and we apply image augmentation including random cropping and flipping during training. The model weights are initialized with pre-trained weights from the ILSVRC-2010-CLS ImageNet classification dataset, omitting the last (classification) layer.

To fully utilize multiple satellite image sources in this study, we design two custom CNN architectures. The first custom architecture, called Shallow Combo, is shown in Figure 5

. It is a modified version of VGG16 where Landsat-8 and Sentinel-1 images are input into the model separately as two branches. In each branch, two convolution layers and one max pooling layer (CONV1 in VGG16) are applied to each image. The two branches are then concatenated along the channel dimension. 1

1 convolutions filters are applied to halve the number of channels after concatenation, which fits the input shape of next convolutional layer. The rest of the VGG16 layers are applied to the merged branch. After the last fully connected layer of VGG16, one more fully connected layer of size 100 is added to the network. The layer is lastly used to predict the output population density. During training, weights of all convolutional layers and the first two fully connected layers are initialized with pre-trained weights from the basic VGG16 architecture trained on ImageNet.

The second custom CNN called Deep Combo is an improved version of Shallow Combo, also shown in Figure 5. The branches of Landsat-8 and Sentinel-1 are fed through all convolutional layers of VGG16, and concatenated along the channel dimension before the first fully connected layer. The same convolutions filters are applied to obtain the same input shape before the fully connected layer. ImageNet weights are also used for initialization until the second fully connected layer.

5. Experimental results

5.1. Evaluation Methods

We evaluate each model on the validation set on two levels: raw village level and aggregated subdistrict level. The raw village level predictions are per-village population density estimates from the model, which are compared with the ground-truth census population density of each village. This raw evaluation represents the most fine-grained comparison possible in our dataset. At the aggregated subdistrict level, the raw per-village density predictions are converted to the total village population using the village area from the survey. The village populations within each subdistrict are aggregated to a total population for that subdistrict. Evaluations at the aggregate levels are performed in previous papers, which we are going to compare with. In short, the village level evaluation uses population density values, while subdistrict level evaluation uses real population numbers.

In terms of evaluation metrics, we compare different models mainly based on

and Pearson Correlation.

measures how much true variance is captured by the model, and attains a perfect value at 1. If a model makes constant average predictions, it will have

score of 0. Pearson Correlation measures the linear correlation between the predicted and true values, implying a total positive correlation when +1 and a total negative correlation when -1. We also use other metrics to compare results with previous papers, such as MAPE (Mean Absolute Percentage Error), %RMSE (percent Root Mean Squared Error).

5.2. Baseline: LandScan

To establish a baseline comparison for this project, we compared our ground truth population densities with those of LandScan, which is considered a community standard for global population distribution (ORNL, 2011). LandScan is a grid containing a numerical population estimate for every cell of size 30x30 arc-second in earth coordinates, which is around 1km on the equator (smaller at different latitudes). This estimate represents an average (over 24 hours), or ambient population distribution (not just sleeping location). For a single village, we take a square of 30 arc-second cells which most closely matches the villages area, as well as the village latitude and longitude. The population density of the village is calculated by dividing the LandScan population estimates within the cells with the area covered by the cells.

We use this method to obtain the LandScan estimate of population density for all villages in our validation dataset. The evaluation results are shown as LandScan model in Table 2. It has good performance in estimating aggregated population on subdistrict level, but it doesn’t perform well on per-village level. It is reasonable because LandScan is a traditional population disaggregation method using only the general features on the maps, such as rivers, roads, land cover types. It does not have fine-grained information as our models have from satellite imagery.

5.3. Single State Training

Due to the large amount of data, we initially generate a smaller dataset with villages in a single state, Gujarat. The small dataset contains around 13,000 villages, which allows us to compare different models quickly, and to tune the hyper-parameters of the CNN models. We experiment with different input sources and the different CNN architectures described above. We train the basic VGG-16 separately with Landsat-8 (L8) and Sentinel-1(S1) inputs. The L8 and S1 inputs are also concatenated along the image channels and feed into VGG-16 as single input (S1L8-Concat). Furthermore, Shallow Combo and Deep Combo architectures are used to process two type of satellite images separately. We use NVIDIA Tesla P100 GPU with 16G memory from Google Compute Engine for training. It sets the limit of input batch size to 48 for the Deep Combo architecture, thus this batch size is also used for other models to ensure a fair comparison. With several iterations, the optimal hyper-parameters are found as follows: learning rate: , exponential learning decay rate: , weight decay: 5x10, dropout: 0.8.

Model Village level Subdistrict level
Pearson Pearson
L8 0.111 0.443 0.597 0.775
S1 0.120 0.425 0.701 0.839
L8S1-Concat 0.111 0.472 0.305 0.764
Shallow Combo 0.200 0.488 0.739 0.889
Deep Combo 0.167 0.489 0.772 0.921
Table 1. Evaluation of Single State Training

The evaluations of the above 5 models are shown in Table 1. In general, the models have lower accuracies when predicting per-village level population densities, but the accuracy significantly improves when the individual predictions are aggregated for the subdistrict level evaluation. The S1 model has better performance than the L8 model, possibly due to its high resolution as well as that measuring ground reflectance by radar is more effective. However, simply concatenating L8 and S1 inputs in L8S1-Concat does not lead to further improvements, resulting in performance slightly worse than using S1 alone. This is likely due to training difficulties when using a combination of L8 and S1 images. In contrast, our Shallow Combo and Deep Combo architectures have significantly better performance. These results indicate that when using inputs from different sources, the CNN model needs to process them separately to extract their semantic features, so that useful information from both sides can be utilized.

5.4. All States Training

After training on a single state, we move on to train on around 350,000 villages in all 32 states in India. L8, S1, Shallow Combo and Deep Combo are trained respectively with the same hyper-parameters found when training in the single state case. The evaluation results are summarized in Table 2. With more data provided, all the models show significant improvements. All models exceed the performance of the LandScan baseline. Shallow Combo and Deep Combo models still outperform the basic VGG-16 models with a single image source. By comparing between two custom architectures, Shallow Combo has better performance on predicting population on a single village, while Deep Combo captures more general information in the region and has better accuracy when predicting aggregated population at the subdistrict level.

Model Village level Subdistrict level
Pearson Pearson
L8 0.346 0.596 0.838 0.919
S1 0.327 0.597 0.890 0.944
Shallow Combo 0.438 0.663 0.906 0.954
Deep Combo 0.389 0.645 0.931 0.965
LandScan -0.553 0.476 0.835 0.928
CONVRAW 0.322 0.592 0.850 0.937
Table 2. Evaluation of All States Training

5.5. Comparison with Prior Works

In this section, we compare our models with others from prior studies, using aggregated evaluation metrics. Table 3 summarizes the reported accuracies from the papers from Doupe et al. in (2016) with study area in Kenya, and from Robinson et al. in 2017 with study area in the United States. Since we do not use subdistrict level census population for either training or prediction improvement, to ensure the fair comparison, we compare to the models in the papers also without the assistance of additional data from the aggregated level, namely LL-raw from Doupe’s paper and CONVRAW from Robinson’s paper. Comparing with Robinson’s paper shows that even though we use a smaller (average) aggregation area in India, we still achieve significantly better and MAPE. The aggregation area in Doupe’s study is even smaller, but their model also has much higher errors compared to ours.

Paper Ours Robinson 2017 Doupe 2016
Study area India US Kenya
Aggr. area 424 km 2584 km 88 km
Model Deep Combo CONVRAW LL-Raw
0.931 0.910 -
MAPE 21.5 73.8 -
%RMSE 24.3 - 145.43
Table 3. Comparison with Prior Works

However, it is difficult to compare models trained and tested across different countries. To address this problem, we use the same methods from Robinson et al., and apply them to our India study area. Specifically, we trained a classification model using a single Landsat input. The categorical outputs of the CNN are in the bins of population density values for each Indian village. Following its CONVRAW approach, we convert the predicted class to population density, and aggregate the population at the subdistrict level. The predictions are still not as good as most of our models, which are shown as CONVRAW in Table 2. Its performance is close to our L8 model, which differs from it by using regression instead of classification. The comparison shows the regression model is better at predicting fine-grained population while classification is better for aggregate predictions.

Finally, we visualize the population mapping of different models in districts of India, the higher administrative level than subdistrict, as shown in Figure 6. It includes 533 districts that contain evaluation results from 1717 validation subdistricts. The visualization includes our Deep Combo and Shallow Combo model predictions compared with ground-truth population density distribution across districts in India. It also compares the prediction errors of the two models with those of the baseline LandScan and CONVRAW model from Robinson et al. Overall, both Deep Combo and Shallow Combo models map the rural village population of India very close to the true distribution. The prediction errors map further reveals that Deep Combo has better performance on high level aggregation than other models, while LandScan tends to underestimate the population and CONVRAW tends to overestimate.

Figure 6. Population mapping visualization on the validation dataset. The upper row shows the population predictions of Deep Combo and Shallow Combo model compared to the ground-truth values. The lower row shows the prediction errors of Deep Combo and Shallow Combo compared with the baseline LandScan, and CONVRAW model from a previous paper (Robinson et al., 2017). The grey areas are the regions with no validation data.

6. Conclusion

The results of our CNN models on the population density prediction directly from satellite imagery are very promising. We see a higher accuracy than previous methods on both per-image raw level and aggregated region level. We attribute the success to the following reasons:

  • Additional Sentinel-1 satellite imagery that provides more information on human settlements.

  • Custom Shallow Combo and Deep Combo CNN architectures that better utilize different imagery sources.

  • A larger and fine-grained dataset for CNN training.

  • Better computational resources that enable larger image input size and batch size.

Our study largely achieves the initial goal of producing accurate population estimates enabling population mapping directly from satellite imagery, especially for rural areas. The results show that population prediction at a relatively coarse scale (such as district or subdistrict level) is quite accurate, while prediction at the village level directly from satellite imagery remains challenging. However, our village level predictions already have significantly better performance than traditional methods such as LandScan, and may still have room for improvement if images with higher resolution are available. Our population mapping models are likely useful for assisting governments in better providing for their citizens, improving resource allocation in natural disasters, aiding in infectious disease tracking, and reducing bias in progress measurement for the United Nations SDGs.

This work was supported by Data for Development Initiative at the Stanford Center on Global Poverty and Development. All data in this work was processed and shared by Paul Novosad (Dartmouth College) and Sam Asher (World Bank).