Predicting Livelihood Indicators from Crowdsourced Street Level Images

06/15/2020 ∙ by Jihyeon Janel Lee, et al. ∙ Stanford University 4

Major decisions from governments and other large organizations rely on measurements of the populace's well-being, but making such measurements at a broad scale is expensive and thus infrequent in much of the developing world. We propose an inexpensive, scalable, and interpretable approach to predict key livelihood indicators from public crowd-sourced street-level imagery. Such imagery can be cheaply collected and more frequently updated compared to traditional surveying methods, while containing plausibly relevant information for a range of livelihood indicators. We propose two approaches to learn from the street-level imagery. First method creates multihousehold cluster representations by detecting informative objects and the second method uses a graph-based approach that leverages the inherent structure between images. By visualizing what features are important to a model and how they are used, we can help end-user organizations understand the models and offer an alternate approach for index estimation that uses cheaply obtained roadway features. By comparing our results against ground data collected in nationally-representative household surveys, we show our approach can be used to accurately predict indicators of poverty, population, and health across India.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In 2015, all member states of the United Nations adopted 17 Sustainable Development Goals, including eliminating poverty, achieving good health and well-being, and stimulating economic growth by 2030 United Nations (2015). To evaluate and monitor countries’ progress toward these goals, national governments and international organizations typically conduct nationally-representative household surveys that collect information on a range of livelihood indicators from households distributed throughout a given country. These surveys, such as the Demographic and Health Surveys (DHS) Program, provide critical insight into local economic and health conditions DHS Program (2019). However, they are costly and time-consuming to conduct, particularly if they involve surveying remote populations linked by poor infrastructure. As a result, many years or even decades pass between surveys in some countries, and the ones that do occur may only capture an extremely small proportion of households.

Recent research has explored the usage of passively collected data sources as cheaper alternatives to door-to-door or paper forms of data collection, to augment or eventually replace these expensive household surveys. Proposed data sources include social media posts Signorini et al. (2011); Pulse (2014), mobile phone networks Blumenstock et al. (2015), and Wikipedia articles Sheehan et al. (2019). Other work has used expensive remote-sensing imagery from satellites to predict livelihood indicators, including road quality in Kenya Cadamuro et al. (2019) and assets and consumption expenditure in other African countries and India Jean et al. (2016); Yeh et al. (2020); Pandey et al. (2018). However, these approaches also face challenges, including large costs to scale Uzkent et al. (2019); Uzkent and Ermon (2020) and sometimes poor generalization to other locations and indicators Head et al. (2017); Bruederle and Hodler (2018)

. Models often also lack interpretability, inhibiting their adoption by policymakers and practitioners who understandably need to be convinced that deep learning models are doing something sensible.

Here we introduce a scalable, interpretable approach that uses street-level imagery for livelihood prediction at a local level across India. We utilize Mapillary, a global citizen-driven street-level imagery database Neuhold et al. (2017). Indeed, publicly available image data has been utilized before to predict outcomes like crime rate, house prices, and voting patterns Arietta et al. (2014); Andersson et al. (2017); Law et al. (2019); Gebru et al. (2017), but the research was conducted primarily in developed and/or limited regions and relied on private data providers, which has unavailable or incomplete data for much of the developing world. Although Mapillary cannot match the consistent quality of commercially-driven imagery from large corporations like Google or Microsoft, its widespread and growing availability throughout the developing world make it an appealing candidate as a new passively collected data source for predicting livelihood indicators. For example, in the span of eight months, the number of Mapillary images doubled from 500 million to 1 billion, with users capturing and verifying imagery from various types of mobile devices 30; 10. The work most similar to ours is Lee et al. (2015), which used user-uploaded Flickr imagery to predict "geo-attributes" (indicators such as population density, infant mortality, and elevation) across the world. We establish Lee et al. (2015) as the initial benchmark and make significant improvements for multiple tasks.

We show how information from Mapillary imagery can be used to accurately predict a range of livelihood indicators across all of India, one of the most populous and economically diverse countries in the world. We present two complementary approaches: (1) The first creates representations for multi-household clusters by segmenting street-level images and aggregating informative objects, and then trains models to visualize what the most predictive features are and how they are used. (2) While the strength of the first approach is interpretability, we also propose a second to learn the relationships between images and leverage the inherent spatial structure of clusters by representing them as graphs, where each image is a feature-rich node connected by edges based on spatial distance. Our approaches perform classification and regression on three indicators — poverty, population, and women’s body mass index (BMI, a key nutritional indicator) — in villages and urban neighborhoods covering the entire country. Our experiments show that we achieve more than accuracy on the task of binary classification for each indicator and scores of 0.54, 0.89, 0.57 on the task of performing regresson for wealth, population density, and BMI. We expect our method to be employed as a cheap, scalable, and effective alternative to traditional surveying methods for organizations to measure the well-being of developing regions. We will release our code and dataset upon publication.

2 Datasets

First, we define the general problem of making predictions on geospatially located clusters. Assume there are geographic clusters. A cluster represents a circular area with center . There is a set of street-level images that fall within its spatial boundaries, , where each is an image and is the total number of images. Each cluster also has surveyed variables of livelihood indicators, represented by . We aim to learn a mapping: to predict given , where is the powerset of and . The regression task entails predicting the indicator directly, or . The classification task involves predicting ; an indicator value greater than or equal to the median results in a label of 1 and 0 otherwise. We perform these tasks over sets of images because clusters can have a variable number of images.

We specialize the general problem to a dataset of street-level imagery and cluster-level labels of indicators in India. Each cluster represents a group of households within a 5km-radius area with label , which contains the index value and class label for indicators, i.e. poverty, population, and women’s BMI. We discuss the source of and in the following section.

2.1 Mapillary for Street-Level Images

The Mapillary API provides access to a variety of resources including images, sequences, users who contribute geo-tagged images, and map features Mapillary (2019). The images are community-driven, and anyone can create an account and upload their own images with EXIF embedded GPS coordinates or identify the image location on a map. In comparison, only 5% of Flickr’s imagery is geotagged Hauff (2013). There were approximately 28.9 million images, most available in the highest resolution of 20481536 pixels, that fall within the spatial boundaries of India from year 2014. The images were then matched to clusters, and only clusters with are kept, resulting in clusters and 1.1 million images total. 98 of images in Mapillary dataset have a resolution of px or px. For the remaining , we download images with resolution px.

2.2 Livelihood Indicators

To demonstrate the usefulness of street-level imagery, we perform experiments on three varied livelihood indicators: poverty, population, and a health-related measure. Each index is naturally continuous. After rescaling the values to be between -1 and 1, we use them directly for regression and discretize them as below or above the median for classification. Figure 1 shows ground-truth labels.

Figure 1: Left to right: wealth, population, and BMI ground-truth cluster labels. To prevent class imbalance, we categorize clusters as above or below the median index value for classification. We also show a close-up view of a given cluster , where the yellow dots represent its images.

Poverty We obtain wealth index values from the Demographic and Health Survey (DHS) program’s the most recently completed survey in 2015-16. DHS data is clustered; each household contributes a single data point, but all households within a 5km-radius cluster share the same geographic coordinates to ensure privacy. There are 31,915 clusters, or 688,919 data points total, in India. The wealth index is calculated from household assets and characteristics, such as vehicles, construction material of home, source of water, etc. We consider poverty to be the inverse of wealth.

Population Facebook’s High Resolution Population Density Maps consist of population density labels for latitude-longitude points across the world. Their data is much denser than that of Mapillary Vistas, so we average the values within a 5km radius of a cluster’s coordinate to produce its label.

Women’s BMI Women’s body-mass-index (BMI) is an important indicator of human well-being. We compute BMI by dividing weight in kilograms by height in meters squared for the 697,486 samples in the DHS survey and average the values across all women in a cluster to produce its label.

3 Methods

Given our dataset, , constructed with geotagged Mapillary images and indexes, we propose methods to learn mapping: . In particular, we focus on two learning paradigm: (1) image-wise learning where we train a model on an image, , sampled from cluster , and (2) cluster-wise learning where we train a model on all the images in a cluster .

3.1 Image-wise Learning

For this method, we directly map each individual image in the cluster, , to the label space as: . The model we train to learn this mapping is referred to as ResNet-ImageWise. As in Figure 2, image-wise predictions, , are combined at test time using an aggregation strategy to produce final predictions, . Each prediction is considered a vote, and the majority class is considered the final prediction for cluster as: .

3.2 Cluster-wise Learning

3.2.1 Learning from Cluster Level Object Counts

In this section, we propose a method to utilize object counts from street-level images in a cluster. Image level object counts are aggregated across the cluster to create cluster-wise object counts. Finally, we train a classifier or regression model on cluster-level object counts to predict indexes.

Figure 2: Overview of the proposed methods. Top: image-wise learning, which learns mapping from imagery and infers on clusters by aggregating predictions. Middle: cluster-wise learning, which uses a panoptic segmentation model to create cluster-level representations. Various models are trained to learn the mapping . Bottom: GCN that represents clusters as a fully connected set of images. Some connections are left out in the above figure for the sake of clarity. Using Graph-Conv layers followed by a Graph-Pool layer, it learns mapping .
Panoptic Segmentation on Mapillary Images

With the street-level imagery in Mapillary and panoptic segmentation dataset called Mapillary Vistas Neuhold et al. (2017), we can train a network to segment street-level images. Mapillary Vistas has 28 stuff and 37 thing classes, where stuff refers to amorphous regions such as "nature" and things are enumerable objects with defined shapes, like "car." It contains 25,000 annotated images with an average resolution of 9 megapixels captured at varying conditions, times, and viewpoints, i.e. from the windshield of a car, walking down a road, etc. These factors make Mapillary Vistas an ideal dataset to train a model to segment objects in our India dataset.

We use the seamless scene segmentation model proposed by Porzi et al. (2019). The model consists of two main modules–instance segmentation and semantic segmentation–and the third module fuses predictions from both to generate panoptic segmentation masks. The instance segmentation module uses Mask-RCNN He et al. (2017), and the semantic segmentation module uses an encoder-decoder architecture similar to the Feature Pyramid Network Lin et al. (2017). Finally, ResNet50 is used to parameterize the backbone model. During training, the Mapillary images are scaled such that the shortest size is pixels, where is randomly sampled from the interval . The authors report mean IoU (intersection over union) score on the Mapillary Vistas test set. To be consistent with the trained model, we scale our Mapillary images from India such that the shortest size is represented by 1024 px.

Cluster Level Object Counts

Using the seamless scene segmentation model, we perform panoptic segmentation on every image for each cluster with the hypothesis that the 65 different types of roadway features can provide useful information. Ayush et al. (2020b, a) indeed found a correlation between objects detected from satellite and poverty in Uganda, so we expected patterns such as more instances of "pothole" in low-wealth areas and more "bike rack" in high-wealth areas. For every cluster, each image is mapped to a set of object detections , where . We then aggregate the detections by summing the number of instances for each class, or . To avoid bias towards clusters with large number of images, we append a feature representing the total number of images in the cluster, or

. In the end, each cluster is represented by a feature vector

. Finally, we map these interpretable embeddings to the label space as: to predict the index or discretized class. We refer to the models that learn this mapping as Obj-ClusterWise.

3.2.2 Graph Convolutional Networks

Our methods thus far process images in a cluster independently, without leveraging the inherent structure between the images. We propose the use of Graph Convolutional Networks (GCN) Such et al. (2017) to exploit the underlying relationships between images in a cluster. We represent clusters as graphs, where image-based features serve as nodes and they are connected by edges that encode their spatial distance. We represent the graphs with matrix and connections between nodes with . Our task is to learn the mapping: : (, ) . We refer to the models that learn this mapping as GCN.

Since we model image connections with scalars, this GCN uses a convex combination of the adjacency and identity matrix to create a filter

that convolves

before passing the output through a ReLU activation and Dropout (

Graph-Conv). The corollary for MaxPooling in a GCN is Graph Embed Pooling, which treats the Graph-Conv output as an embedding matrix that can reduce and to a desired number of vertices, optimally representing the graph in reduced dimensions (Graph-Pool).

Node Representations Each image in a cluster is represented by a node, which is composed of CNN features from ResNet-ImageWise, detected object counts ( in Obj-ClusterWise), or the combination of the two. The node representations for any cluster is , where is the maximum number of images per cluster and is the size of feature vector for each image.

Modeling Connections Between Nodes There is an inherent spatial structure to clusters as Mapillary images uploaded by users are geo-tagged and captured while driving or walking on roads. We take advantage of this structure by connecting the image nodes. We initialize as the normalized inverse distance between two images in a cluster. That is, let be the spatial distance in kilometer unit between two images and in cluster . Let the maximum distance between any two images in any one cluster is . In this case, for , . This way, we construct matrix using a scalar for each edge and we get: for any cluster .

4 Experiments

We perform extensive experiments on our dataset consisting of Mapillary images and ground-truth indexes. As our work is the first to utilize Mapillary images for indicator prediction, we build baselines to realistically measure the contribution of our model. We measure performance using classification accuracy and the square of the Pearson correlation coefficient (Pearson’s ) for regression.

Baselines

For each cluster, we predict a local (geographic) average of neighboring clusters as our baseline. This is to simulate the fact that we often have access to district-level statistics about livelihood indicators. We approximate the district-level values as the mode or mean (in the case of binary classification and regression, respectively) of the 1,000 clusters closest to . In other words, the baseline predicts the from , where represents a subset of clusters sorted by distance to in increasing order where .

We also establish Lee et al. (2015) as the previous the previous benchmark. It does not address the exact same task but is most similar to ours as it predicts "geo-attributes," or indicators, from crowd-sourced imagery. The authors source the imagery from Flickr, not Mapillary, and reports results at global level while we work on India specifically. The work predicts on population density, but because it does not use wealth and BMI, we report their results on GDP and infant morality as they are the most similar respective tasks.

Implementation Details For ResNet-ImageWise, all images are resized to . We train a ResNet34 He et al. (2015)

model initialized with weights pretrained on ImageNet 

Russakovsky et al. (2015) to learn a mapping from image to . There is one classification and one regression label for each of the three indicator (

). We train with a batch size of 128 and learning rate of 0.001 for 10 epochs.

We train different models as part of Obj-ClusterWise

, which map the object counts feature to the label space. For the classification task, we use a 3-layer Multi-Layer Perceptron (layer size 256, ReLU activation, learning rate of 0.001), Random Forest (300 trees), Gradient Boosted Decision Trees (300 boosting stages), and k-Nearest Neighbors (

). The same models are used for regression.

For the GCN, we feed the representation of image nodes via and node connections via into two Graph-Conv layers of size 64 (each followed by a ReLU activation) followed by a Graph-Pool of size 32. These operations are followed by another pair of Graph-Conv layers of size 32 and a Graph-Pool of size 8. The resulting output is then fed into a fully connected dense layer of size 256 and a final output layer of size 2 for binary classification or size 1 for regression. We trained with a batch size of 256 and learning rate of 0.0001. We use Adam optimizer Kingma and Ba (2014) for all the experiments in this study.

4.1 Predictions on Livelihood Indicators

Poverty We make improvements on Lee et al. (2015) with higher accuracy on classification. First, we observe that ResNet-ImageWise, which learns the mapping from imagery directly to , achieves 74.34% accuracy. Obj-ClusterWise performs comparably to ResNet-ImageWise, with the Random Forest obtaining . The GCN model with image features further improves performance at 81.06%, potentially because the GCN learns from both visual features and spatial relationships between images. Wealth does not persist over large areas and can shift between clusters, so the baseline does especially poorly on regression. On the other hand, our approach still achieves scores of 0.52 to 0.54, suggesting the semantic information encoded in object counts and edge connections is helpful.

Wealth Population BMI
Source Feature Method Acc. Acc. Acc.
Flickr Image Lee et al. (2015) 71.33* 74.43* 55.88*
Mapillary Image CNN+Majority Vote 74.34 93.50 85.28
Mapillary Object Counts MLP (3-layer) 74.98 91.71 82.60
Mapillary Object Counts Random Forest 75.77 88.99 83.49
Mapillary Object Counts GBDT 74.91 89.35 83.63
Mapillary Object Counts kNN (k=3) 68.69 85.78 77.98
Mapillary Object Counts GCN 72.05 86.63 80.13
Mapillary CNN Representations GCN 81.06 94.71 89.42
Mapillary CNN Representations + Object Counts GCN 80.91 94.42 89.56
Training Data None (Baseline) Random 50.70 50.11 51.47
Training Data Lat/Lon Coords (Baseline) Avg of Neighbors 63.57 69.06 66.17
Table 1: Classification results on wealth, population, and BMI prediction in India.
* We consider Lee et al. (2015) to be the most similar work and refer to it as the past benchmark. As noted in Section 4, Lee et al. (2015) made global predictions, not on India only, and we show their results on GDP and infant mortality, the indicators most similar to wealth and BMI, respectively. We establish a new benchmark for predicting indicators from crowd-sourced street-level imagery.
Wealth Population BMI
Source Feature Method
Mapillary Image CNN 0.51 0.85 0.52
Mapillary Object Counts MLP (3-layer) 0.52 0.81 0.53
Mapillary Object Counts Random Forest 0.52 0.79 0.54
Mapillary Object Counts GBDT 0.51 0.78 0.52
Mapillary Object Counts kNN(k=3) 0.34 0.75 0.36
Mapillary Object Counts GCN 0.39 0.86 0.38
Mapillary CNN Representations GCN 0.54 0.82 0.57
Mapillary CNN Representations + Object Counts GCN 0.53 0.89 0.56
Training Data Lat/Lon Coords (Baseline) Avg of Neighbors 0.16 0.66 0.25
Table 2: Regression results on wealth, population, and BMI prediction in India. There have been efforts to predict poverty in India or geospatially smaller countries in Africa (Related Works on page 1). However, to the best of our knowledge, we are the first to present a scalable, intereptable pipeline that makes predictions from only crowdsourced street-level imagery across India.

Population Population density is more geospatially consistent than poverty, so the baselines performed slightly better but still poorly compared to our models. We showed significant improvement from Lee et al. (2015); the CNN achieved 93.50% classification accuracy, the MLP 91.71%, and the GCN 94.71%. We hypothesized that there would be clear visual indicators of population captured by the imagery and detected objects (e.g. the "human-person" class and instances of infrastructure and transportation), and we explore these in the next section. Our models also showed strong performance on regression. The GCN obtained the highest score of 0.89, most likely benefiting from the combination of visual features, object counts, and spatial information.

Women’s BMI BMI is an example of a health-related indicator that may not be obvious or explicit from imagery. Yet, our models were still able to make significant improvements on the baseline and Lee et al. (2015), from which we use infant mortality as the stand-in indicator. Our GCN achieved 89.56% classification accuracy and 0.57 . While our models performed the non-intuitive task of predicting BMI from imagery, we also note that neither high BMI nor low BMI is desirable. In future experiments, we consider changing from BMI to difference from healthy BMI, which the DHS has defined to be between 18.5 to 24.9 DHS Program (2019).

Figure 3: Accuracy vs. # of images per cluster.

4.2 Analysis: Effect of Number of Images

We analyzed how many street-level images are necessary for a cluster to be sufficiently representative. We sampled sets of images, starting from 50 to 200 (the maximum size), from each cluster to serve as the training set. We then trained the model , an MLP, for 100 epochs with a learning rate of 1e-3 and evaluated it on the classification task for all three indicators. Our results are shown in Figure 3. Adding more images causes the accuracy to increase as expected. Surprisingly, a relatively small number of images (roughly 75) per cluster is sufficient to achieve high accuracy in all three tasks.

5 Interpretability

Figure 4: Top: Most important features for each indicator when using a random forest model. Bottom: Example images from each class. Left to right: poverty, population, BMI. We used permutation importance as the metric of feature importance because it was shown in Strobl et al. (2007)

to be more reliable than mean-decrease-in-impurity importance, which can be highly biased when predictors variables vary in their scale of measurement or number of categories, as in our case.

Interpretability of the predictions is an important aspect to consider for estimates from machine learning models to be adopted by policy and decision makers. Currently, the DHS constructs the wealth index by conducting principal component analysis (PCA) on hand-collected characteristics of each household, including assets (i.e. televisions, bicycles, etc.), materials for housing construction, and types of water access 

DHS (2016). We show it is possible to predict indicators from features visible from the road, offering a much cheaper but still effective method of index estimation. Furthermore, many existing models for predicting livelihood indicators from passively collected data sources Jean et al. (2016); Cadamuro et al. (2019); Sheehan et al. (2019) are accurate but inherently not interpretable. We show which classes of objects are important for a given index and visualize decision trees to help end-user organizations understand the model.

Feature Importance

As shown in Figure 4, objects that signal development and growth, such as cars, traffic lights, street lights, and construction (barrier walls), were salient to poverty prediction. Instances of terrain (e.g. dirt or exposed ground alongside a road) were informative as well, most likely because they indicate a lack of urbanization. For population, the model relied on infrastructure, such as rail tracks and bridges, as well as modes of transportation, e.g. "bicyclist," "motorcycle," and "truck." For women’s BMI, the most important features were streetlights, manholes, and billboards, which include storefronts and advertisements that indicate the existence of services and development.

Figure 5: Decision tree visualizations generated using Parr and Grover (2020). Each node displays the object class name and threshold that determines how to split the node (left child means the threshold, right child the threshold). Poverty classification on left, regression on right. See full tree visualizations for all three indicators in the Appendix (Section Appendix in Supplementary Material).
Visualizing Decision Trees

In addition to important features for each indicator, we visualize decision trees to help end-user organizations understand how the model makes decisions. In Figure 5, we show a single decision tree classifier and regressor with a max depth of 3, using Parr and Grover (2020). At each node, moving to the left means the cluster has less than the threshold number of that object (to the right means greater than or equal to the threshold). Final predictions are at the leaves, where is the number of clusters assigned to that leaf. To demonstrate how to use the trees, we trace three paths.
Path 1: As expected, a cluster with many traffic lights and little terrain (indicating urbanization) will be predicted as low wealth when there are few cars and high wealth when there are many cars.
Path 2: Construction-type barrier walls are salient for clusters in this path, so they may be sites of development and growth for organizations to monitor.
Path 3: The regression tree provides nuance as the model predicts the actual index value. Billboards are informative, which is intuitive as storefronts and advertisements signal economic activity.

6 Conclusion

In this work, we present a novel approach to make predictions on poverty, population, and women’s body-mass index from street-level imagery. In spite of the inconsistent quality of such a large-scale, crowd-sourced dataset, we achieve strong performance in predicting indicators that have not previously been the focus of machine learning methods, such as women’s BMI, a key nutritional indicator. We demonstrate how our method is scalable, making predictions across India, one of the most populous and diverse countries in the world. We present two approaches: (1) cluster-wise learning, which produces representations of clusters from detected object counts and enables interpretability and (2) a graph-based approach that aims to capture the inherent spatial structure of a cluster by representing the images as feature-rich nodes connected by edges that incorporate spatial distance. We hope that our method can be employed as a cheap but effective alternative to traditional surveying methods for organizations to measure the well-being of developing regions.

References

  • [1] V. O. Andersson, M. A. F. Birck, R. M. Araujo, and C. Cechinel (2017-12)

    Towards crime rate prediction through street-level images and siamese convolutional neural networks

    .
    In ENIAC - Encontro Nacional de Inteligência Artificial e Computacional, pp. . Cited by: §1.
  • [2] S. M. Arietta, A. A. Efros, R. Ramamoorthi, and M. Agrawala (2014-12) City forensics: using visual elements to predict non-visual city attributes. IEEE Transactions on Visualization and Computer Graphics 20 (12), pp. 2624–2633. External Links: Document, ISSN 2160-9306 Cited by: §1.
  • [3] K. Ayush, B. Uzkent, M. Burke, D. Lobell, and S. Ermon (2020)

    Efficient poverty mapping using deep reinforcement learning

    .
    arXiv preprint arXiv:2006.04224. Cited by: §3.2.1.
  • [4] K. Ayush, B. Uzkent, M. Burke, D. Lobell, and S. Ermon (2020) Generating interpretable poverty maps using object detection in satellite images. arXiv preprint arXiv:2002.01612. Cited by: §3.2.1.
  • [5] J. Blumenstock, G. Cadamuro, and R. On (2015) Predicting poverty and wealth from mobile phone metadata. Science 350 (6264), pp. 1073–1076. Cited by: §1.
  • [6] A. Bruederle and R. Hodler (2018) Nighttime lights as a proxy for human development at the local level. PloS one 13 (9), pp. e0202231. Cited by: §1.
  • [7] G. Cadamuro, A. Muhebwa, and J. Taneja (2019) Street smarts: measuring intercity road quality using deep learning on satellite imagery. In Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies, pp. 145–154. Cited by: §1, §5.
  • [8] DHS Program (2019) Demographic and Health Survey (DHS) Program. Note: https://dhsprogram.com/What-We-Do/Survey-Types/DHS.cfm Cited by: §1, §4.1.
  • [9] DHS (2016) Wealth index construction. ICF. External Links: Link Cited by: §5.
  • [10] (2019) Equipment for capturing and example imagery. Mapillary. External Links: Link Cited by: §1.
  • [11] T. Gebru, J. Krause, Y. Wang, D. Chen, J. Deng, E. L. Aiden, and L. Fei-Fei (2017) Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proceedings of the National Academy of Sciences 114 (50), pp. 13108–13113. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/114/50/13108.full.pdf Cited by: §1.
  • [12] C. Hauff (2013-07) A study on the accuracy of flickr’s geotag data. pp. 1037–1040. External Links: Document Cited by: §2.1.
  • [13] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 2961–2969. Cited by: §3.2.1.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. External Links: 1512.03385 Cited by: §4.
  • [15] A. Head, M. Manguin, N. Tran, and J. E. Blumenstock (2017) Can human development be measured with satellite imagery?. In ICTD, Cited by: §1.
  • [16] N. Jean, M. Burke, M. Xie, W. M. Davis, D. B. Lobell, and S. Ermon (2016) Combining satellite imagery and machine learning to predict poverty. Science 353 (6301), pp. 790–794. External Links: Document, ISSN 0036-8075, Link, https://science.sciencemag.org/content/353/6301/790.full.pdf Cited by: §1, §5.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [18] S. Law, B. Paige, and C. Russell (2019-09) Take a look around: using street view and satellite images to estimate house prices. ACM Trans. Intell. Syst. Technol. 10 (5). External Links: ISSN 2157-6904, Link, Document Cited by: §1.
  • [19] S. Lee, H. Zhang, and D. J. Crandall (2015) Predicting geo-informative attributes in large-scale image collections using convolutional neural networks. In 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 550–557. Cited by: §1, §4, §4.1, §4.1, §4.1, Table 1.
  • [20] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 2117–2125. Cited by: §3.2.1.
  • [21] Mapillary (2019) Mapillary developer api. Mapillary. External Links: Link Cited by: §2.1.
  • [22] G. Neuhold, T. Ollmann, S. Rota Bulò, and P. Kontschieder (2017) The mapillary vistas dataset for semantic understanding of street scenes. In International Conference on Computer Vision (ICCV), External Links: Link Cited by: §1, §3.2.1.
  • [23] S. Pandey, T. Agarwal, and N. C. Krishnan (2018)

    AAAI conference on artificial intelligence

    .
    In Multi-Task Deep Learning for Predicting Poverty From Satellite Images, External Links: Link Cited by: §1.
  • [24] T. Parr and P. Grover (2020) Dtreeviz: decision tree visualization. GitHub. Note: https://github.com/parrt/dtreeviz Cited by: Figure 5, §5, Figure 6, Figure 7, Figure 8.
  • [25] L. Porzi, S. R. Bulo, A. Colovic, and P. Kontschieder (2019-06) Seamless scene segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.1.
  • [26] U. G. Pulse (2014) Mining indonesian tweets to understand food price crises. Jakarta: UN Global Pulse. Cited by: §1.
  • [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.
  • [28] E. Sheehan, C. Meng, M. Tan, B. Uzkent, N. Jean, D. Lobell, M. Burke, and S. Ermon (2019) Predicting economic development using geolocated wikipedia articles. arXiv preprint arXiv:1905.01627. Cited by: §1, §5.
  • [29] A. Signorini, A. M. Segre, and P. M. Polgreen (2011) The use of twitter to track levels of disease activity and public concern in the us during the influenza a h1n1 pandemic. PloS one 6 (5), pp. e19467. Cited by: §1.
  • [30] J. E. Solem (2019-12) A look back at 2019 and the road to one billion images. Mapillary. External Links: Link Cited by: §1.
  • [31] C. Strobl, A. Boulesteix, A. Zeileis, and T. Hothorn (2007-01-25) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8 (1), pp. 25. External Links: ISSN 1471-2105, Document, Link Cited by: Figure 4.
  • [32] F. P. Such, S. Sah, M. Domínguez, S. Pillai, C. Zhang, A. Michael, N. D. Cahill, and R. W. Ptucha (2017) Robust spatial filtering with graph convolutional neural networks. CoRR abs/1703.00792. External Links: Link, 1703.00792 Cited by: §3.2.2.
  • [33] United Nations (2015) Sustainable Development Goals : Sustainable Development Knowledge Platform. Note: https://sustainabledevelopment.un.org/?menu=1300 Cited by: §1.
  • [34] B. Uzkent and S. Ermon (2020) Learning when and where to zoom with deep reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12345–12354. Cited by: §1.
  • [35] B. Uzkent, E. Sheehan, C. Meng, Z. T. M. Burke, D. Lobell, and S. Ermon (2019) Learning to interpret satellite images using wikipedia. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3620–3626. Cited by: §1.
  • [36] C. Yeh, A. Perez, A. Driscoll, G. Azzari, Z. Tang, D. Lobell, S. Ermon, and M. Burke (2020) Using publicly available satellite imagery and deep learning to understand economic well-being in africa. Nature Communications 11 (1), pp. 1–11. Cited by: §1.

Appendix

6.1 Visualizations of Decision Trees

Figure 6: Decision tree visualizations generated using [24]. Each node displays the object class name and threshold that determines how to split the node (left child means the threshold, right child the threshold). Leaves represent predictions, where is the number of clusters assigned to that leaf.
Top: Wealth classification tree. The histograms show the feature space distribution for a single feature, with the colors indicating the relationship between feature space and target class. For example, for the "car" histogram, we see that the yellow bars are clustered at the lower end, which is intuitive for low wealth. The histogram gets proportionally shorter as the number of clusters that reach the node decreases, and the leaf size becomes smaller as well. [24] motivates the use of pie charts to show quickly an indication of the strong majority category through color and size of slice.
Bottom: Wealth regression tree. For the regressor, feature-target space is shown with a scatterplot of feature vs. target. Regressor leaves use a strip plot to show the distribution explicitly (the number of dots is number of clusters assigned to the leaf), and the leaf prediction is the distribution center of mass, or mean, which is denoted with a dashed line.
Figure 7: Decision tree visualizations generated using [24]. Population classification on top, regression on bottom.
Figure 8: Decision tree visualizations generated using [24]. Women’s BMI classification on top, regression on bottom.

6.2 Visualizations of Predictions

Here we map our classification predictions across the country using the Obj-Clusterwise model. Correct predictions are green. False positives (predicted as "high" area for indicator but actually "low") are marked in red, while false negatives (predicted as "low" but actually "high") are colored purple.

Figure 9: Map of predictions for poverty classification.
Figure 10: Map of predictions for population density classification.
Figure 11: Map of predictions for women’s BMI classification.