Efficient Poverty Mapping using Deep Reinforcement Learning

06/07/2020 ∙ by Kumar Ayush, et al. ∙ Stanford University 5

The combination of high-resolution satellite imagery and machine learning have proven useful in many sustainability-related tasks, including poverty prediction, infrastructure measurement, and forest monitoring. However, the accuracy afforded by high-resolution imagery comes at a cost, as such imagery is extremely expensive to purchase at scale. This creates a substantial hurdle to the efficient scaling and widespread adoption of high-resolution-based approaches. To reduce acquisition costs while maintaining accuracy, we propose a reinforcement learning approach in which free low-resolution imagery is used to dynamically identify where to acquire costly high-resolution images, prior to performing a deep learning task on the high-resolution images. We apply this approach to the task of poverty prediction in Uganda, building on an earlier approach that used object detection to count objects and use these counts to predict poverty. Our approach exceeds previous performance benchmarks on this task while using 80 application in many sustainability domains that require high-resolution imagery.



There are no comments yet.


page 3

page 7

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When combined with machine learning, satellite imagery has proven broadly useful for a range of computer vision tasks including object detection 

Lam et al. (2018), object tracking Uzkent et al. (2017, 2018), cloud removal Sarukkai et al. (2020) and sustainability-related tasks, from poverty prediction Jean et al. (2016); Ayush et al. (2020); Sheehan et al. (2019); Blumenstock et al. (2015); Yeh et al. (2020) to infrastructure measurement Cadamuro et al. (2018) to forest and water quality monitoring Fisher et al. (2018) to the mapping of informal settlements Mahabir et al. (2018). Compared to coarser (10-30m) publicly-available imagery Drusch et al. (2012), high-resolution () imagery has proven particularly useful for these tasks because is often able to resolve specific objects or features that are critical for downstream tasks but that are undetectable in coarser imagery.

For example, recent work demonstrated an approach for predicting local-level consumption expenditure using object detection on high-resolution daytime satellite imagery  Ayush et al. (2020), showing how this approach can yield interpretable predictions and also outperform previous benchmarks that rely on lower-resolution, publicly-available satellite imagery Drusch et al. (2012). This additional information, however, typically comes at a cost, as high-resolution satellite imagery must be purchased from private providers. Additionally, processing high-resolution images is computationally more expensive than the coarser resolution ones Uzkent et al. (2019); Zhu et al. (2016); Meng et al. (2017); Lampert et al. (2008); Wojek et al. (2008); Redmon and Farhadi (2017); Gao et al. (2018). Given these costs, deploying these models at scale using high-resolution imagery quickly becomes cost-prohibitive for most organizations and research teams, inhibiting the broader development and deployment of machine-learning based tools and insights based on these data.

To address this problem, we propose a reinforcement learning approach that uses coarse, freely-available public imagery to dynamically identify where to acquire costly high-resolution images, prior to conducting an object detection task. This concept leverages publicly available Sentinel-2 Drusch et al. (2012) images (10-30m) to sample smaller amount of high-resolution images (<1m). Our framework is inspired from the recent studies in computer vision literature that perform conditional inference to reduce computational complexity of convolutional networks in test time Uzkent and Ermon (2020); Wu et al. (2018).

We apply our approach to the domain of poverty prediction, and show how our approach can substantially reduce the cost of previous methods that used deep learning on high-resolution images to predict poverty Ayush et al. (2020)

while maintaining or even improving their accuracy. In our study country of Uganda, we show how our approach can reduce the number of high-resolution images needed by 80%, in turn reducing the cost of making a country-wide poverty map using this approach by an estimate $2.9 million. We leave the exploration of our cost-aware adaptive framework for other computer vision tasks using high-resolution satellite images as a future work.

2 Poverty Mapping from Remote Sensing Imagery

Poverty is typically measured using consumption expenditure, the value of all the goods and services consumed by a household in a given period. A household or individual is said to be poverty stricken if their measured consumption expenditure falls below a defined threshold (currently $1.90 per capita per day). We focus on this consumption expenditure as our outcome of interest, using “poverty" as shorthand for “consumption expenditure" throughout the paper. While typical household surveys measure consumption expenditure at the household level, publicly available data typically only release geo-coordinate information at the “cluster" level – which is a village in rural areas and a neighborhood in urban areas. Efforts to predict poverty have thus focused on predicting at the cluster level (or more aggregated levels) Ayush et al. (2020).

Earlier work Ayush et al. (2020) demonstrated state-of-the-art results for predicting village-level poverty using high-resolution satellite imagery, and showed how such predictions could be made with an interpretable model. In particular, this work trained an object detector to obtain classwise object counts (buildings, trucks, passenger vehicles, railway vehicles, etc.) in high-resolution images, and then used these counts in a regression model to predict poverty. Not only were these categorical features predictive of poverty, but their counts had clear and intuitive relationships with the outcome of interest. The cost of this accuracy and interpretability was the high-resolution imagery, which typically must be purchased for $10-20 per km from private providers.

We build on these earlier approaches here. Let be a set of villages surveyed, where is the latitude and longitude coordinates for cluster , and is the corresponding average poverty index for a particular year. For each cluster , we can acquire both high-resolution and low-resolution satellite imagery corresponding to the survey year, , a image with channels, and , a image with channels. Here represents a scalar to show the resolution difference between low-resolution and high-resolution images. Following Ayush et al. (2020), our goal is to learn a regressor to predict the poverty index using and only limited informative regions of .


Figure 1: Schematic overview of the proposed approach. The Policy Network uses cheaply available Sentinel-2 low-resolution image representing a cluster to output a set of actions representing unique 10001000 px high-resolution tiles in the 3434 grid. Then object detection is performed on the sampled HR tiles (black regions represent dropped tiles) to obtain the corresponding class-wise object counts (

-dimensional vectors). Finally, the classwise object counts vectors corresponding to the acquired HR tiles are added element-wise to get the final feature vector representing the cluster. Our reinforcement learning approach dynamically identifies where to acquire high-resolution images, conditioned on cheap, low-resolution data, before performing object detection, whereas the previous work

Ayush et al. (2020) exhaustively uses all the HR tiles representing a cluster for poverty mapping, making their method expensive and less practical.

3 Dataset

Socio-economic Data. Our ground truth dataset consists of data on consumption expenditure (poverty) from Living Standards Measurement Study (LSMS) survey conducted in Uganda by the Uganda Bureau of Statistics between 2011 and 2012 UBOS (2012). The survey consists of data from 2,716 households in Uganda, which are grouped into unique locations called clusters. The latitude and longitude location, , of a cluster is given, with noise of up to km added in each direction by the surveyers to protect respondent privacy. Individual household locations in each cluster are also withheld to preserve anonymity. We have N=320 clusters in the survey which we use to test the performance of our method in terms of predicting the average poverty index, , for a group . For each , the survey measures the poverty level by the per capital daily consumption in dollars which we refer to as the “LSMS poverty score" for simplicity like Ayush et al. (2020). Fig. 1 (bottom left corner) visualizes the surveyed locations on the map along with their corresponding LSMS poverty scores, revealing that a high percentage of surveyed locations have relatively low consumption expenditure values.

Satellite Imagery. We acquire both high-resolution and low-resolution satellite imagery for Uganda. The high-resolution satellite imagery, , corresponding to cluster (roughly, a village or neighborhood) is represented by T=3434=1156 images of 10001000 pixels each with 3 channels, arranged in a 3434 square grid. This corresponds to a 10km10km spatial neighborhood centered at . A large neighborhood is considered to deal with up-to 5km of random noise in the cluster coordinates that has been added by the survey organization to protect respondent privacy. These high-resolution images come from DigitalGlobe satellites with 3 bands (RGB) and 30cm pixel resolution. Formally, we represent all the high-resolution images corresponding to as a sequence of tiles as .

We also acquire low-resolution satellite imagery, , corresponding to cluster is represented by a single image of pixels with channels. These mages come from Sentinel-2 with 3 bands (RGB) and 10m pixel resolution and are freely available to the public. Each image corresponds to the same 10km10km spatial neighborhood centered at , however the resolution is much lower – each Sentinel-2 pixel corresponds to roughly 1000 pixels from the high-resolution imagery. Because of this low-resolution, it is not possible to perform fine-grained object detection just using these images. Fig. 1 illustrates an example cluster from Uganda.

4 Fine-grained Object Detection on High-Resolution Satellite Imagery

Similar to Ayush et al. (2020)

, we use an intermediate object detection phase to obtain categorical features (classwise object counts) from high-resolution tiles of a cluster. Due to lack of object annotations for satellite images from Uganda, we use the same transfer learning strategy as in

Ayush et al. (2020) by training an object detector (YOLOv3 Redmon and Farhadi (2018)) on xView Lam et al. (2018), one of the largest and most diverse publicly available overhead imagery datasets for object detection with parent-level and child-level classes. Earlier work Ayush et al. (2020) studied both parent-level and child-level detectors and empirically find that not only the parent-level object detection features are better for poverty regression but at the same time are more suited for interpretability due to household level descriptions. Thus, we train YOLOv3 detector using parent-level classes (see x-axis labels of Fig. 2).

As described in Section 3, each representing a cluster is a set of high-resolution images, . To obtain a baseline model that uses all the high-resolution imagery available, we follow the protocol in Ayush et al. (2020) and run the trained YOLOv3 object detector on each 10001000px tile (i.e. ) to get the correspoding set of object detections and each detection is denoted by a tuple , where and represent the center coordinates of the bounding box, and represent the width and height of the bounding box, and represent the object class label and class confidence score. For each tile of , we get object detections. Similar to Ayush et al. (2020), we use these object detections to generate a -dimensional vector, (where =10 is the number of object labels/classes), by counting the number of detected objects in each class. This process results in -dimensional vectors, which can be aggregated into a single -dimensional categorical feature vector by summing over the tiles: . This classwise object counts can be used in a regression model for poverty estimation Ayush et al. (2020). Ayush et al. (2020) exhaustively uses all T=1156 HR tiles of a cluster for poverty estimation. In contrast, we propose to use a method that adaptively selects informative regions for high-resolution acquisition conditioned on the publicly available, low-resolution data. We describe our solution in the next section.

5 Adaptive Tile Selection

Due to the large acquisition cost of HR images, it is non-trivial and expensive to deploy models based on HR imagery at scale. For this reason, we propose an efficient tile selection framework to capture relevant fine level information such as classwise object counts for downstream tasks. We represent the HR image covering a spatial cluster centered at as where , and represent height width and number of bands. Additionally, we represent the LR image of the same spatial cluster as where represents a scalar for the number of pixels in width and height. For example, in the case of Sentinel-2 (10 m GSD), we have times smaller number of pixels than the high-resolution DigitalGlobe images (0.3m GSD). With an adaptive approach, our task is to acquire only small subset of conditionally on while not hurting the performance in our downstream tasks that uses object counts from the cluster

. This adaptive method is formulated as a two-step episodic Markov Decision Process (MDP), similar to 

Uzkent et al. (2020). In the first step, we adaptively sample HR tiles and in the second step, we run them through a pre-trained detector.

Adaptive Selection. The first module of our framework finds tiles to sample/acquire, conditioned on the low spatial resolution image covering a cluster. In this direction, a cluster-level HR image is divided into equal-size non-overlapping tiles, where is the number of tiles. In this set up, we model

as a latent variable as it is not directly observed and it is inferred from the random variable

. We associate each tile, , of with an -dimensional classwise object counts feature represented as . We then model the policy network to only choose tiles where there is desirable number of object counts as: Acquire if where is determined by the policy network that uses a reward function characterized by the user. Similar to , we decompose the random variable as where represents the lower spatial resolution version (from Sentinel-2) of .

Modeling the Policy Network’s Input and Output. In a simple scenario, we can take a single binary action for each whether to acquire it or not conditioned on . However, we believe that choosing multiple actions representing different disjoint subtiles of tile can help us avoid sampling areas of tile where there are no objects of interest. In another setup, we can use a large T to have smaller tiles and take a single action to sample tile or not. However, this introduces run-time complexity since we need to run the policy network more number of times to cover a cluster. For these reasons, we divide tile into number of disjoint subtiles as . In the first step of the MDP, the agent observes and outputs a binary action array, , where represents acquisition of the HR version of the -th subtile of i.e. . The subtile sampling policy, parameterized by , is formulated as



is a function mapping the observed LR image to a probability distribution over subtile sampling actions

. The joint probability distribution over the random variables , , , and action , can be written as


Object Detection. In the second step of the MDP, the agent runs the object detection on the selected HR subtiles. Conditioned on , it observes HR subtiles if necessary and produces , a -dimensional classwise object counts vector. We find the object counts with our adaptive framework using a pre-trained object detector (parameterized by ) as:


Then, we compute the tile level object counts as . Finally, we define our overall cost function as:


where the reward depends on , , . Our goal is to learn the parameters given a pre-trained object detector to maximize the objective being a function of the reward function. We detail the reward function in Section 6.

6 Modeling and Optimization of the Policy Network

Modeling the Policy Network. In the previous section, in high level we formulated the task of efficient HR subtile selection as a two step episodic MDP. In this section, we model how to learn the policy distribution for subtile sampling. In this study, we have number of tiles as we have a 3434 grid of images. In this case, each tile consists of 10001000 pixels. As mentioned in the previous section, we divide each tile into S=4 subtiles of 250250 pixels each. In this study, similar to Uzkent et al. (2020) we model the action likelihood function of the policy network,

, using the product of bernoulli distributions as:


We use a sigmoid function to transform logits to probabilistic values,


Optimization of the Policy Network. Next we detail optimization procedure for the policy network. Previously defined objective function as shown in Eq. 4 is not differentiable w.r.t the policy network parameters, . This is because we discretize continuous action probabilities from the policy network to perform binary action of acquiring or not acquiring a subtile. To overcome this, we use one of the model-free reinforcement learning algorithms called Policy Gradient Sutton and Barto (2018). Our final objective function as shown below includes the reward function as well as action likelihood distribution which can be differentiated w.r.t .


Our objective function relies on mini-batch Monte-Carlo sampling to approximate the expectation. Especially, in scenarios where we can not afford large mini-batches, we can have highly oscillating expectations which results in large variance. As this can de-stabilize the optimization, we use the self-critical baseline 

Rennie et al. (2017), , to reduce the variance.


where represents the baseline action vector. To get it, we use the most likely action vector proposed by the policy network: i.e., if and otherwise. Finally, in this study we use temperature scaling Sutton and Barto (2018) to adjust exploration/exploitation trade-off during optimization time as


Setting to a large value results in sampling from the learned policy whereas the small values lead to sampling from random policy.

Modeling the Reward Function. The proposed framework uses the policy gradient reinforcement learning algorithm to learn the parameters of the policy network , adjusting weights to increase the expected reward value. Thus, it is crucial to design a reward function reflecting the desired characteristics of an efficient subtile selection method from a cluster representing a large area. The desired outcome from our adaptive strategy is to reduce the image acquisition cost drastically by sampling smaller subset of tiles. Taking this into account, we design a dual reward function that encourages dropping as many subtiles as possible while successfully approximating the classwise object counts. We define as follows:


where is object counts approximation accuracy and represents the image acquisition cost with as its coefficient. The term encourages acquiring a subtile when the counts difference between the object counts from fixed HR subtile sampling policy and the adaptive policy is positive. We increase the reward linearly with the smaller number of acquired subtiles for the cost component. See appendix for the pseudocode and other implementation details.

7 Experiments

Poverty Estimation. Previous work Ayush et al. (2020) exhaustively performed object detection on all the HR tiles representing a cluster to obtain -dimensional vectors, , which are then aggregated into a single -dimensional categorical feature vector, , by summing over the tiles i.e. . This was subsequently used in a regression model to predict poverty score for cluster . Using our adaptive method, we obtain , which is an approximate classwise counts vector for cluster . Following Ayush et al. (2020)

, we consider Gradient Boosting Decision Trees as the regression model to estimate the poverty index,

, given the cluster level categorical feature vector (classwise object counts), /. We use Pearson’s to quantify the model performance. Invariance under seperate changes in scale between two variables allows Pearson’s to provide insights into the ability of the model at distinguishing poverty levels.

Training and Evaluation. We have N=320 clusters in the survey. We divide the dataset into a 80-20 train-test split. We train a GBDT model using object counts features () based on all HR tiles of the clusters in the trainset. We use the clusters in the trainset to train the policy network for adaptive tile selection. The trained policy network is then used to acquire informative HR tiles for each test cluster i.e for a test cluster , the policy network selects HR tiles (subsequently used to obtain ) conditioned on low-resolution input representing the cluster. The obtained is then passed through the trained GBDT model to get the poverty score . See appendix for more implementation details.

Baselines and State-of-the-Art Models. We compare our method with the following: (a) No Patch Dropping, where we simply use all the HR tiles in to get the classwise object counts features (same as Ayush et al. (2020)), (b) Fixed Policy-X samples HR tiles from the center of a cluster, (c) Random Policy-X samples HR tiles randomly from a cluster, (d) Stochastic Policy-X, samples HR tiles where the survival likelihood of a tile decays w.r.t the euclidean distance from the cluster center, and (e) Nightlights, where we use Nightlight Images ( px) representing the clusters in Uganda and sample only those HR tiles which have non-zero nighttime light intensities.

! No Dropping Ayush et al. (2020) Fixed-18 Random-25 Stochastic-25 Nightlights Ours (Dry sea.) Ours (Wet sea.) 0.53 0.43 0.34 0.26 0.45 0.51 0.62 HR Acquisition 1.0 0.18 0.25 0.25 0.12 0.19 0.19

Table 1: LSMS poverty score prediction results in Pearson’s for various methods. HR Acquisition represents the fraction of HR tiles acquired.

Additionally, since Sentinel-2 imagery is freely available, we perform a comparative analysis of the effect of season on the ability of the policy network at approximating classwise object counts. We thus acquired two sets of low-resolution imagery, one from dry-season (Dec - Feb) in Uganda and other from wet season (March-May, Sept-Nov) corresponding to the survey year. Seasonality is likely highly relevant in our agrarian setting, where crops are grown during the wet season and much related market activity is highly seasonal. We hypothesize that greenery in low-resolution imagery during wet season will better indicate which patches might contain useful economic information.


Figure 2: Number of objects missed on an average across clusters for each parent-level class. The colored bars in each subplot from left-right are: Ours (wet season), Ours (dry season), Nightlight, Fixed-18, Random-25, Stochastic-25.


(a) No Dropping


(b) Nightlights


(c) Ours (Dry Season)


(d) Ours (Wet Season)
Figure 3: LSMS poverty score regression results of GBDT.

Quantitative Analysis. Fig. 2 compares the ability of various methods at approximating the classwise object counts. It shows the number of objects missed on an average across clusters for each parent class, where we can see that our method (using wet season imagery) is better able to approximate the “true object counts” (we use object detector predictions on all the HR tiles as a proxy for true values) compared to baseline methods and our method (using dry season imagery). Table 1 presents the corresponding HR acquisition fractions revealing that out method is able to identify informative tiles leading to a lower HR requirement compared to various baselines. Table 1 also shows the results of poverty prediction in Uganda for our proposed method against these baselines and previous benchmarks. Our model (wet season) achieves 0.62 and substantially outperforms the published state-of-the-art results Ayush et al. (2020) (0.53 ) while using around 80% fewer satellite images. We similarly outperform other baselines as well. A scatter plot of GBDT LSMS poverty score predictions v.s. ground truth is shown in Fig. 3. It can be seen that the GBDT model can maintain explainability of a large fraction of the variance based on object counts identified from the sampled HR tiles using our method, compared to Ayush et al. (2020) that exhaustively uses all HR tiles.

The superior performance of our approach relative to other baselines and to previous work that uses all tiles suggests that our model is learning to sample the correct regions in a large image. The previous work Ayush et al. (2020) show that Trucks had a higher impact on LSMS poverty score prediction and explained that regions with good transport connectivity tend to have higher #Trucks. Fig. 4 (d) and (e) present an example highlighting the ability of the policy network (conditioned on wet season imagery) to identify such regions leading to a more accurate approximation of #Trucks (see Fig. 2) thus leading to an improved performance.










Figure 4: (a) High-Resolution Satellite Imagery representing a cluster. (b) Sentinel-2 Imagery of the cluster from dry season. (c) Corresponding HR acquisitions when dry-season imagery is input to the Policy Network. (d) Sentinel-2 Imagery of the cluster from wet season. (e) Corresponding HR acquisitions when wet-season imagery is input to the Policy Network.

Analysis based on Season. We observe that presence of greenery during wet season allows the policy network to better identify the informative regions containing objects, compared to when trained with dry season Sentinel-2 imagery as input to the network. Figure 4 presents an example cluster, where it can be seen that training the policy network using wet season imagery better assists the network at sampling informative tiles whereas the one trained using dry season imagery misses out on some important tiles thus hindering performance in a downstream task. See Appendix for more visuals.


Figure 5: Trade-off between Pearson’s and coefficent of image acquisition cost (). Text accompanying the points represents HR acquisition fraction.

PerformanceSampling Trade-off.

Next, we analyze the trade-off between accuracy (GBDT regression performance) and HR sampling rate controlled by hyperparameter

in Eq. 13. We intentionally increase/decrease to quantify the effect on the policy network. As seen in Fig. 5, the policy network samples less HR tiles (0.09) when we increase to 2.0 and the goes down to 0.48. On the other hand, when we set to 1.0, we get optimal results in at 0.18 HR acquisition fraction.


(a) No Dropping


(b) Ours (Dry Season)


(c) Ours (Wet Season)
Figure 6: Summary of the effects of all features using SHAP, showing the distribution of the impacts each feature has on the model output. Color represents the feature value (red high, blue low).

Impact on Interpretability. An important contribution of Ayush et al. (2020) was to introduce model interpretability allowing successful application of such methods in many policy domains. They use Tree SHAP (Tree SHapley Additive exPlanations) Lundberg and Lee (2017), a game theoretic approach to explain the output of tree-based models, to explain the effect of individual features on poverty predictions and show that the presence of trucks appeared to be particularly useful for measuring local scale poverty. Here, we show that in addition to closely approximating the classwise object counts, our method retains the same findings in terms of interpretability as that of Ayush et al. (2020). Fig. 6 shows the plots of SHAP values of every feature for every cluster for three different methods. The features are sorted by the sum of SHAP value magnitudes over all samples. It can be seen that our method still maintains that #Trucks tends to have a higher impact on the model’s output. We also observe that ordering of features in terms of SHAP values is fairly similar between the No Dropping approach Ayush et al. (2020) and our method (using wet season imagery) giving strong evidence that wet season imagery is better for such adaptive solutions.

Cost saving. Current pricing for high-resolution (30cm) three-band (RGB) imagery is $10-20 per km. Given that Uganda is 240k km in land area, creating a poverty map using our method would save roughly $2.9 million if imagery costs $15 per km, given that we would only need 20% of the country to be tiled. This represents a potentially large cost saving if our approach is to be scaled at country or continent scale.

8 Conclusion

In this study, we increase the efficiency of recent methods of predicting consumption expenditure using object counts from high-resolution satellite images. To achieve this, we proposed a novel reinforcement learning setup to conditionally acquire high-resolution tiles. We designed a cost-aware reward function to reflect real-world constraints – i.e. budget and GPU availability – and then trained a policy network to approximate object counts in a given location as closely as possible given these constraints. We show that our approach reduces the number of high-resolution images needed by 80% while improving downstream poverty estimation performance relative to multiple other approaches, including a method that exhaustively uses all high-resolution images from a location. Future work includes application of our adaptive method to other sustainability-related computer vision tasks using high-resolution images at large scale.


  • [1] K. Ayush, B. Uzkent, M. Burke, D. Lobell, and S. Ermon (2020) Generating interpretable poverty maps using object detection in satellite images. arXiv preprint arXiv:2002.01612. Cited by: Appendix B, §1, §1, §1, Figure 1, §2, §2, §2, §3, §4, §4, Table 1, §7, §7, §7, §7, §7.
  • [2] J. Blumenstock, G. Cadamuro, and R. On (2015) Predicting poverty and wealth from mobile phone metadata. Science 350 (6264), pp. 1073–1076. Cited by: §1.
  • [3] G. Cadamuro, A. Muhebwa, and J. Taneja (2018) Assigning a grade: accurate measurement of road quality using satellite imagery. arXiv preprint arXiv:1812.01699. Cited by: §1.
  • [4] M. Drusch, U. D. Bello, S. Carlier, O. Colin, V. Fernandez, F. Gascon, B. Hoersch, C. Isola, P. Laberinti, P. Martimort, A. Meygret, F. Spoto, O. Sy, F. Marchese, and P. Bargellini (2012) Sentinel-2: esa’s optical high-resolution mission for gmes operational services. Remote Sensing of Environment 120, pp. 25 – 36. External Links: Document, ISSN 0034-4257, Link Cited by: §1, §1, §1.
  • [5] J. R. Fisher, E. A. Acosta, P. J. Dennedy-Frank, T. Kroeger, and T. M. Boucher (2018) Impact of satellite imagery spatial resolution on land use classification accuracy and modeled water quality. Remote Sensing in Ecology and Conservation 4 (2), pp. 137–149. Cited by: §1.
  • [6] M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis (2018-06) Dynamic Zoom-in Network for Fast Object Detection in Large Images. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Salt Lake City, UT, USA, pp. 6926–6935. External Links: Document, ISBN 978-1-5386-6420-9, Link Cited by: §1.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Appendix B.
  • [8] N. Jean, M. Burke, M. Xie, W. M. Davis, D. B. Lobell, and S. Ermon (2016) Combining satellite imagery and machine learning to predict poverty. Science 353 (6301), pp. 790–794. Cited by: §1.
  • [9] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, and B. McCord (2018) Xview: objects in context in overhead imagery. arXiv preprint arXiv:1802.07856. Cited by: §1, §4.
  • [10] C. H. Lampert, M. B. Blaschko, and T. Hofmann (2008) Beyond sliding windows: object localization by efficient subwindow search. In 2008 IEEE conference on computer vision and pattern recognition, pp. 1–8. Cited by: §1.
  • [11] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. External Links: Link Cited by: §7.
  • [12] R. Mahabir, A. Croitoru, A. T. Crooks, P. Agouris, and A. Stefanidis (2018) A critical review of high and very high-resolution remote sensing approaches for detecting and mapping slums: trends, challenges and emerging opportunities. Urban Science 2 (1), pp. 8. Cited by: §1.
  • [13] Z. Meng, X. Fan, X. Chen, M. Chen, and Y. Tong (2017) Detecting small signs from large images. In 2017 IEEE International Conference on Information Reuse and Integration (IRI), pp. 217–224. Cited by: §1.
  • [14] J. Redmon and A. Farhadi (2017-07) YOLO9000: Better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 6517–6525. External Links: Document, ISBN 9781538604571, Link Cited by: §1.
  • [15] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: Appendix B, §4.
  • [16] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017)

    Self-critical sequence training for image captioning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §6.
  • [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015-12) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. External Links: Document, ISSN 1573-1405, Link Cited by: Appendix B.
  • [18] V. Sarukkai, A. Jain, B. Uzkent, and S. Ermon (2020) Cloud removal from satellite images using spatiotemporal generator networks. In The IEEE Winter Conference on Applications of Computer Vision, pp. 1796–1805. Cited by: §1.
  • [19] E. Sheehan, C. Meng, M. Tan, B. Uzkent, N. Jean, M. Burke, D. Lobell, and S. Ermon (2019) Predicting economic development using geolocated wikipedia articles. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2698–2706. Cited by: §1.
  • [20] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §6, §6.
  • [21] U. B. o. S. UBOS (2012) Uganda national panel survey 2011/2012. Uganda. Cited by: §3.
  • [22] B. Uzkent and S. Ermon (2020) Learning when and where to zoom with deep reinforcement learning. arXiv preprint arXiv:2003.00425. Cited by: §1.
  • [23] B. Uzkent, A. Rangnekar, and M. J. Hoffman (2018) Tracking in aerial hyperspectral videos using deep kernelized correlation filters. IEEE Transactions on Geoscience and Remote Sensing 57 (1), pp. 449–461. Cited by: §1.
  • [24] B. Uzkent, A. Rangnekar, and M. Hoffman (2017) Aerial vehicle tracking by adaptive fusion of hyperspectral likelihood maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 39–48. Cited by: §1.
  • [25] B. Uzkent, E. Sheehan, C. Meng, Z. Tang, M. Burke, D. Lobell, and S. Ermon (2019) Learning to interpret satellite images in global scale using wikipedia. arXiv preprint arXiv:1905.02506. Cited by: §1.
  • [26] B. Uzkent, C. Yeh, and S. Ermon (2020) Efficient object detection in large images using deep reinforcement learning. In The IEEE Winter Conference on Applications of Computer Vision, pp. 1824–1833. Cited by: §5, §6.
  • [27] C. Wojek, G. Dorkó, A. Schulz, and B. Schiele (2008) Sliding-windows for rapid object class localization: a parallel technique. In Joint Pattern Recognition Symposium, pp. 71–81. Cited by: §1.
  • [28] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris (2018) Blockdrop: dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8817–8826. Cited by: §1.
  • [29] C. Yeh, A. Perez, A. Driscoll, G. Azzari, Z. Tang, D. Lobell, S. Ermon, and M. Burke (2020) Using publicly available satellite imagery and deep learning to understand economic well-being in africa. Nature Communications 11 (1), pp. 1–11. Cited by: §1.
  • [30] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu (2016) Traffic-sign detection and classification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2110–2118. Cited by: §1.

Appendix A Pseudocode

Input: (, ) 
for  do
       for  do
       end for
       Evaluate Reward
end for
Algorithm 1 Pseudo-code for the Proposed Adaptive Algorithm. and represent the number of tiles and subtiles.

Appendix B Implementation Details

Policy Networks.

To parameterize the policy network, we use ResNet He et al. (2016) with 32 layers pretrained on the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) dataset Russakovsky et al. (2015). We train the policy network using 2 NVIDIA 1080ti GPUs.

Object Detectors.

Our object detector use the YOLOv3 architecture Redmon and Farhadi (2018), chosen for its reasonable trade off between accuracy on small objects and run-time performance. The backbone network, DarkNet-53, is pre-trained on ImageNet. Following Ayush et al. (2020), we perform transfer learning by training the detector on xView dataset and running it on the Uganda HR patches.

Appendix C Visualizations


(a) High-Resolution Satellite Imagery (downsampled by 34 for visualization).


(b) Sentinel-2 Imagery for a cluster from dry season.


(c) Corresponding HR acquisitions when dry-season imagery is input the Policy Network.


(d) Sentinel-2 Imagery for a cluster from wet season.


(e) Corresponding HR acquisitions when wet-season imagery is input the Policy Network.
Figure 7: Example 1


(a) High-Resolution Satellite Imagery (downsampled for visualization).


(b) Sentinel-2 Imagery for a cluster from dry season.


(c) Corresponding HR acquisitions when dry-season imagery is input the Policy Network.


(d) Sentinel-2 Imagery for a cluster from wet season.


(e) Corresponding HR acquisitions when wet-season imagery is input the Policy Network.
Figure 8: Example 2