Learning When and Where to Zoom with Deep Reinforcement Learning

03/01/2020 ∙ by Burak Uzkent, et al. ∙ Stanford University 9

While high resolution images contain semantically more useful information than their lower resolution counterparts, processing them is computationally more expensive, and in some applications, e.g. remote sensing, they can be much more expensive to acquire. For these reasons, it is desirable to develop an automatic method to selectively use high resolution data when necessary while maintaining accuracy and reducing acquisition/run-time cost. In this direction, we propose PatchDrop a reinforcement learning approach to dynamically identify when and where to use/acquire high resolution data conditioned on the paired, cheap, low resolution images. We conduct experiments on CIFAR10, CIFAR100, ImageNet and fMoW datasets where we use significantly less high resolution data while maintaining similar accuracy to models which use full high resolution images.



There are no comments yet.


page 2

page 6

page 7

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks now achieve state-of-the-art performance in many computer vision tasks, including image recognition 

[4], object detection [23], and object tracking [18, 44, 47]. However, one drawback is that they require high quality input data to perform well, and their performance drops significantly on degraded inputs, e.g., lower resolution images [59], lower frame rate videos [28], or under distortions [41]. For example, [55] studied the effect of image resolution, and reported a performance drop on CIFAR10 after downsampling images by a factor of 4.

Nevertheless, downsampling is often performed for computational and statistical reasons [61]. Reducing the resolution of the inputs decreases the number of parameters, resulting in reduced computational and memory cost and mitigating overfitting [2]. Therefore, downsampling is often applied to trade off computational and memory gains with accuracy loss [25]. However, the same downsampling level is applied to all the inputs. This strategy can be suboptimal because the amount of information loss (e.g., about a label) depends on the input [7]. Therefore, it would be desirable to build an adaptive system to utilize a minimal amount of high resolution data while preserving accuracy.

Figure 1: Left: shows the performance of the ResNet34 model trained on the fMoW original images and tested on images with dropped patches. The accuracy of the model goes down with the increased number of dropped patches. Right: shows the proposed framework which dynamically drops image patches conditioned on the low resolution images.

In addition to computational and memory savings, an adaptive framework can also benefit application domains where acquiring high resolution data is particularly expensive. A prime example is remote sensing, where acquiring a high resolution (HR) satellite image is significantly more expensive than acquiring its low resolution (LR) counterpart [26, 32, 9]. For example, LR images with 10m-30m spatial resolution captured by Sentinel-1 satellites [8, 36] are publicly and freely available whereas an HR image with 0.3m spatial resolution captured by DigitalGlobe satellites can cost in the order of 1,000 dollars [6]

. This way, we can reduce the cost of deep learning models trained on satellite images for a variety of tasks, i.e., poverty prediction 

[37], image recognition [49, 38], object tracking [46, 45, 48]. Similar examples arise in medical and scientific imaging, where acquiring higher quality images can be more expensive or even more harmful to patients [16, 15].

In all these settings, it would be desirable to be able to adaptively acquire only specific parts of the HR quality input. The challenge, however, is how to perform this selection automatically and efficiently, i.e., minimizing the number of acquired HR patches while retaining accuracy. As expected, naive strategies can be highly suboptimal. For example, randomly dropping patches of HR satellite images from the functional Map of the World (fMoW) [3] dataset will significantly reduce accuracy of a trained network as seen in Fig. (a)a. As such, an adaptive strategy must learn to identify and acquire useful patches [27] to preserve the accuracy of the network.

To address this challenges, we propose PatchDrop, an adaptive data sampling-acquisition scheme which only samples patches of the full HR image that are required for inferring correct decisions, as shown in Fig. (b)b. PatchDrop uses LR versions of input images to train an agent in a reinforcement learning setting to sample HR patches only if necessary. This way, the agent learns when and where to zoom in the parts of the image to sample HR patches. PatchDrop is extremely effective on the functional Map of the World (fMoW) [3] dataset. Surprisingly, we show that we can use only about of full HR images without any significant loss of accuracy. Considering this number, we can save in the order of 100,000 dollars when performing a computer vision task using expensive HR satellite images at global scale. We also show that PatchDrop performs well on traditional computer vision benchmarks. On ImageNet, it samples about of HR images on average with a minimal loss in the accuracy. On a different task, we then increase the run-time performance of patch-based CNNs, BagNets [1], by 2 by reducing the number of patches that need to be processed using PatchDrop. Finally, leveraging the learned patch sampling policies, we generate hard positive training examples to boost the accuracy of CNNs on ImageNet and fMoW by 2-3.

2 Related Work

Dynamic Inference with Residual Networks Similarly to DropOut [39][13] proposed a stochastic layer dropping method when training the Residual Networks [11]

. The probability of survival linearly decays in the deeper layers following the hypothesis that low-level features play key roles in correct inference. Similarly, we can decay the likelihood of survival for a patch w.r.t its distance from image center based on the assumption that objects will be dominantly located in the central part of the image. Stochastic layer dropping provides only training time compression. On the other hand,

[53, 57] proposes reinforcement learning settings to drop the blocks of ResNet in both training and test time conditionally on the input image. Similarly, by replacing layers with patches, we can drop more patches from easy samples while keeping more from ambiguous ones.

Attention Networks Attention methods have been explored to localize semantically important parts of images [52, 31, 51, 42][52] proposes a Residual Attention network that replaces the residual identity connections from [11] with residual attention connections. By residually learning feature guiding, they can improve recognition accuracy on different benchmarks. Similarly, [31] proposes a differentiable saliency-based distortion layer to spatially sample input data given a task. They use LR images in the saliency network that generates a grid highlighting semantically important parts of the image space. The grid is then applied to HR images to magnify the important parts of the image. [22] proposes a perspective-aware scene parsing network that locates small and distant objects. With a two branch (coarse and fovea) network, they produce coarse and fine level segmentations maps and fuse them to generate final map. [60] adaptively resizes the convolutional patches to improve segmentation of large and small size objects.  [24] improves object detectors using pre-determined fixed anchors with adaptive ones. They divide a region into a fixed number of sub-regions recursively whenever the zoom indicator given by the network is high. Finally,  [30] proposes a sequential region proposal network (RPN) to learn object-centric and less scattered proposal boxes for the second stage of the Faster R-CNN [33]. These methods are tailored for certain tasks and condition the attention modules on HR images. On the other hand, we present a general framework and condition it on LR images.

Analyzing Degraded Quality Input Signal There has been a relatively small volume of work on improving CNNs’ performance using degraded quality input signal [20][40] uses knowledge distillation to train a student network using degraded input signal and the predictions of a teacher network trained on the paired higher quality signal. Another set of studies [55, 58] propose a novel method to perform domain adaptation from the HR network to a LR network. [29] pre-trains the LR network using the HR data and finetunes it using the LR data. Other domain adaptation methods focus on person re-identification with LR images [14, 21, 54]. All these methods boost the accuracy of the networks on LR input data, however, they make the assumption that the quality of the input signal is fixed.

3 Problem statement

Figure 2:

Our Bayesian Decision influence diagram. The LR images are observed by the agent to sample HR patches. The classifier then observes the agent-sampled HR image together with the LR image to perform prediction. The ultimate goal is to choose an action to sample a masked HR image to maximize the expected utilities considering the accuracy and the cost of using/sampling HR image.

We formulate the PatchDrop

framework as a two step episodic Markov Decision Process (MDP), as shown in the influence diagram in Fig. 


. In the diagram, we represent the random variables with a circle, actions with a square, and utilities with a diamond. A high spatial resolution image,

, is formed by equal size patches with zero overlap , where represents the number of patches. In contrast with traditional computer vision settings, is latent, i.e., it is not observed by the agent. is a categorical random variable representing the (unobserved) label associated with , where is the number of classes. The random variable low spatial resolution image, , is the lower resolution version of . is initially observed by the agent in order to choose the binary action array, , where means that the agent would like to sample the -th HR patch . We define the patch sampling policy model parameterized by , as



is a function mapping the observed LR image to a probability distribution over the patch sampling action

. Next, the random variable masked HR image, , is formed using and , with the masking operation formulated as . The first step of the MDP can be modeled with a joint probability distribution over the random variables, , , , and , and action , as


In the second step of the MDP, the agent observes the random variables, and , and chooses an action . We then define the class prediction policy as follows:


where represents a classifier network parameterized by . The overall objective, , is then defined as maximizing the expected utility, represented by


where the utility depends on , , and . The reward penalizes the agent for selecting a large number of high-resolution patches (e.g., based on the norm of ) and includes a classification loss evaluating the accuracy of given the true label (e.g., cross-entropy or 0-1 loss).

4 Proposed Solution

4.1 Modeling the Policy Network and Classifier

In the previous section, we formulated the task of PatchDrop as a two step episodic MDP similarly to [50]. Here, we detail the action space and how the policy distributions for and are modelled. To represent our discrete action space for , we divide the image space into equal size patches with no overlaps, resulting in patches, as shown in Fig. 3. In this study, we use regardless of the size of the input image and leave the task of choosing variable size bounding boxes as a future work. In the first step of the two step MDP, the policy network, , outputs the probabilities for all the actions at once after observing . An alternative approach could be in the form of a Bayesian framework where is conditioned on  [7, 30]. However, the proposed concept of outputting all the actions at once provides a more efficient decision making process for patch sampling.

Figure 3: The workflow of the PatchDrop formulated as a two step episodic MDP. The agent chooses actions conditioned on the LR image, and only agent sampled HR patches together with LR images are jointly used by the two-stream classifier network. We note that the LR network can be disconnected from the pipeline to only rely on selected HR patches to perform classification. When disconnecting LR network, the policy network samples more patches to maintain accuracy.

In this study, we model the action likelihood function of the policy network,

, by multiplying the probabilities of the individual high-resolution patch selections, represented by patch-specific Bernoulli distributions as follows:



represents the prediction vector formulated as


To get probabilistic values,

, we use a sigmoid function on the final layer of the policy network.

The next set of actions, , is chosen by the classifier, , using the sampled HR image and the LR input . The upper stream of the classifier, , uses the sampled HR images, , whereas the bottom stream uses the LR images, , as shown in Fig. 3. Each one outputs probability distributions, and

, for class labels using a softmax layer. We then compute the weighted sum of predictions via


where represents the number of sampled patches. To form , we use the maximally probable class label: i.e., if and otherwise where represents the class index. In this set up, if the policy network samples no HR patch, we completely rely on the LR classifier, and the impact of the HR classifier increases linearly with the number of sampled patches.

4.2 Training the PatchDrop Network

After defining the two step MDP and modeling the policy and classifier networks, we detail the training procedure of PatchDrop. The goal of training is to learn the optimal parameters of and . Because the actions are discrete, we cannot use the reparameterization trick to optimize the objective w.r.t. . To optimize the parameters of , we need to use model-free reinforcement learning algorithms such as Q-learning [56] and policy gradient [43]. Policy gradient is more suitable in our scenario since the number of unique actions the policy network can choose is and increases exponentially with . For this reason, we use the method [43] to optimize the objective w.r.t using


Averaging across a mini-batch via Monte-Carlo sampling produces an unbiased estimate of the expected value, but with potentially large variance. Since this can lead to an unstable training process 

[43], we replace in Eq. 8 with the advantage function to reduce the variance:


where and represent the baseline action vectors. To get , we use the most likely action vector proposed by the policy network: i.e., if and otherwise. The classifier, , then observes and sampled using , on two branches and outputs the predictions, , from which we get : i.e., if and otherwise where represent the class index. The advantage function assigns the policy network a positive value only when the action vector sampled from Eq. 5 produces higher reward than the action vector with maximum likelihood, which is known as a self-critical baseline [34].

Finally, in this study we use the temperature scaling method [43] to encourage exploration during training time by bounding the probabilities of the policy network as


where .

Pre-training the Classifier After formulating our reinforcement learning setting for training the policy network, we first pre-train the two branches of , and , on and . We assume that is observable in the training time. The network trained on can perform reasonably (Fig. (a)a) when the patches are dropped at test time with a fixed policy, forming . We then use this observation to pre-train the policy network, , to dynamically learn to drop patches while keeping the parameters of and fixed.

Pre-training the Policy Network (Pt) After training the two streams of the classifier, , we pre-train the policy network, , using the proposed reinforcement learning setting while fixing the parameters of . In this step, we only use to estimate the expected reward when learning . This is because we want to train the policy network to understand which patches contribute most to correct decisions made by the HR image classifier, as shown in Fig. (a)a.

 Input: Input(, , ) 
for  do
       Evaluate Reward
end for
for  do
       Jointly Finetune and using
end for
for  do
       Jointly Finetune and using and
end for
Algorithm 1 PatchDrop Pseudocode

Finetuning the Agent and HR Classifier (Ft-1) To further boost the accuracy of the policy network, , we jointly finetune the policy network and HR classifier, . This way, the HR classifier can adapt to the sampled images, , while the policy network learns new policies in line with it. The LR classifier, , is not included in this step.

Finetuning the Agent and HR Classifier (Ft-2) In the final step of the training stage, we jointly finetune the policy network, , and with the addition of into the classifier . This way, the policy network can learn policies to drop further patches with the existence of the LR classifier. We combine the HR and LR classifiers using Eq. 7. Since the input to does not change, we keep fixed and only update and . The algorithm for the PatchDrop training stage is shown in Alg. 1. Upon publication, we will release the code to train and test PatchDrop.

5 Experiments

5.1 Experimental Setup

Datasets and Metrics We evaluate PatchDrop on the following datasets: (1) CIFAR10, (2) CIFAR100, (3) ImageNet [4] and (4) functional map of the world (fMoW) [3]. To measure its performance, we use image recognition accuracy and the number of dropped patches (cost).

Implementation Details In CIFAR10/CIFAR100 experiments, we use a ResNet8 for the policy and ResNet32 for the classifier networks. The policy and classifier networks use 88px and 3232px images. In ImageNet/fMoW, we use a ResNet10 for the policy network and ResNet50 for the classifier. The policy network uses 5656px images whereas the classifier uses 224224px images. We initialize the weights of the LR classifier with HR classifier [29] and use Adam optimizer in all our experiments [17]. Finally, initially we set the exploration/exploitation parameter, , to 0.7 and increase it to 0.95 linearly over time.

Reward Function We choose if and - otherwise as a reward. Here, and represent the predicted class by the classifier after the observation of and and the true class, respectively. The proposed reward function quadratically increases the reward w.r.t the number of dropped patches. To adjust the trade-off between accuracy and the number of sampled patches, we introduce and setting it to a large value encourages the agent to sample more patches to preserve accuracy.

5.2 Baseline and State-of-The-Art Models

No Patch Sampling/No Patch Dropping In this case, we simply train a CNN on LR or HR images with cross-entropy loss without any domain adaptation and test it on LR or HR images. We call them LR-CNN and HR-CNN.

Fixed and Stochastic Patch Dropping We propose two baselines that sample central patches along the horizontal and vertical axes of the image space and call them Fixed-H and Fixed-V. We list the sampling priorities for the patches in this order 5,6,9,10,13,14,1,2,0,3,4,7,8,11,15 for Fixed-H, and 4,5,6,7,8,9,10,11,12,13,14,15,0,1,2,3 for Fixed-V. The patch IDs are shown in Fig. 3. Using a similar hypothesis, we then design a stochastic method that decays the survival likelihood of a patch w.r.t the euclidean distance from the center of the patch to the image center.

Figure 4: Policies learned on the fMoW dataset. In columns 5 and 8, Ft-2 model does not sample any HR patches and the LR classifier is used. Ft-1 model samples more patches as it does not utilize LR classifier.

Super-resolution We use SRGAN [19] to learn to upsample LR images and use the SR images in the downstream tasks. This method only improves accuracy and increases computational complexity since SR images have the same number of pixels with HR images.

Attention-based Patch Dropping

In terms of the state-of-the art models, we first compare our method to the Spatial Transformer Network (STN) by 

[31]. We treat their saliency network as the policy network and sample the top activated patches to form masked images for classifier.

Domain Adaptation Finally, we use two of the state-of-the-art domain adaptation methods by [55, 40] to improve recognition accuracy on LR images. These methods are based on Partially Coupled Networks (PCN), and Knowledge Distillation (KD) [12].

The LR-CNN, HR-CNN, PCN, KD, and SRGAN are standalone models and always use full LR or HR image. For this reason, we have same values for them in Pt, Ft-1, and Ft-2 steps and show them in the upper part of the tables.

5.3 Experiments on fMoW

One application domain of the PatchDrop is remote sensing where LR images are significantly cheaper than HR images. In this direction, we test the PatchDrop on functional Map of the World [3]

consisting of HR satellite images. We use 350,000, 50,000 and 50,000 images as training, validation and test images. After training the classifiers, we pre-train the policy network for 63 epochs with a learning rate of 1e-4 and batch size of 1024. Next, we finetune (Ft-1 and Ft-2) the policy network and HR classifiers with the learning rate of 1e-4 and batch size of 128. Finally, we set

to 0.5, 20, and 20 in the pre-training, and fine-tuning steps.

Acc. (%)
Acc. (%)
Acc. (%)
LR-CNN 61.4 0 61.4 0 61.4 0
SRGAN [19] 62.3 0 62.3 0 62.3 0
KD [40] 63.1 0 63.1 0 63.1 0
PCN [55] 63.5 0 63.5 0 63.5 0
HR-CNN 67.3 16 67.3 16 67.3 16
Fixed-H 47.7 7 63.3 6 64.9 6
Fixed-V 48.3 7 63.2 6 64.7 6
Stochastic 29.1 7 57.1 6 63.6 6
STN [31] 46.5 7 61.8 6 64.8 6
PatchDrop 53.4 7 67.1 5.9 68.3 5.2
Table 1: The performance of the proposed PatchDrop and baseline models on the fMoW dataset. represents the average number of sampled patches. Ft-1 and Ft-2 represent the finetuning steps with single and two stream classifiers.
Figure 5: Left: The accuracy and number of sampled patches by the policy network w.r.t downsampling ratio used to get LR images for the policy network and classifier. Right: The accuracy and number of sampled patches w.r.t to parameter in the reward function in the joint finetuning steps (=4). It is set to 0.5 in the pre-training step.

As seen in Table 1, PatchDrop samples only about of each HR image on average while increasing the accuracy of the network using the full HR images to . Fig. 4 shows some examples of how the policy network chooses actions conditioned on the LR images. When the image contains a field with uniform texture, the agent samples a small number of patches, as seen in columns 5, 8, 9 and 10. On the other hand, it samples patches from the buildings when the ground truth class represents a building, as seen in columns 1, 6, 12, and 13.

Figure 6: Policies learned on ImageNet. In columns 3 and 8, Ft-2 model does not sample any HR patches and the LR classifier is used. Ft-1 model samples more patches as it does not use the LR classifier.

Also, we perform experiments with different downsampling ratios and values in the reward function. This way, we can observe the trade-off between the number of sampled patches and accuracy. As seen in Fig. 5, as we increase the downsampling ratio we zoom into more patches to maintain accuracy. On the other hand, with increasing , we zoom into more patches as larger value penalizes the policies resulting in unsuccessful classification.

CIFAR10 CIFAR100 ImageNet
Acc. (%)
Acc. (%)
Acc. (%)
Acc. (%)
Acc. (%)
Acc. (%)
Acc. (%)
Acc. (%)
Acc. (%)
LR-CNN 75.8 75.8 75.8 0,0,0 55.1 55.1 55.1 0,0,0 58.1 58.1 58.1 0,0,0
SRGAN [19] 78.8 78.8 78.8 0,0,0 56.1 56.1 56.1 0,0,0 63.1 63.1 63.1 0,0,0
KD [40] 81.8 81.8 81.8 0,0,0 61.1 61.1 61.1 0,0,0 62.4 62.4 62.4 0,0,0
PCN [40] 83.3 83.3 83.3 0,0,0 62.6 62.6 62.6 0,0,0 63.9 63.9 63.9 0,0,0
HR-CNN 92.3 92.3 92.3 16,16,16 69.3 69.3 69.3 16,16,16 76.5 76.5 76.5 16,16,16
Fixed-H 71.2 83.8 85.2 9,8,7 48.5 65.8 67.0 9,10,10 48.8 68.6 70.4 10,9,8
Fixed-V 64.7 83.4 85.1 9,8,7 46.2 65.5 67.2 9,10,10 48.4 68.4 70.8 10,9,8
Stochastic 40.6 82.1 83.7 9,8,7 27.6 63.2 64.8 9,10,10 38.6 66.2 68.4 10,9,8
STN [31] 66.9 85.2 87.1 9,8,7 41.1 64.3 66.4 9,10,10 58.6 69.4 71.4 10,9,8
PatchDrop 80.6 91.9 91.5 8.5,7.9,6.9 57.3 69.3 70.4 9,9.9,9.1 60.2 74.9 76.0 10.1,9.1,7.9
Table 2: The results on CIFAR10, CIFAR100 and ImageNet datasets. represents the average number of sampled patches per image. The Pt, Ft-1 and Ft-2 represent the pre-training and finetuning steps with single and two stream classifiers.

Experiments on CIFAR10/CIFAR100 Although CIFAR datasets already consists of LR images, we believe that conducting experiments on standard benchmarks is useful to characterize the model. For CIFAR10, after training the classifiers, we pre-train the policy network with a batch size of 1024 and learning rate of 1e-4 for 3400 epochs. In the joint finetuning stages, we keep the learning rate, reduce the batch size to 256, and train the policy and HR classifier networks for 1680 and 990 epochs, respectively. is set to -0.5 in the pre-training stage and -5 in the joint finetuning stages whereas is tuned to 0.8. Our CIFAR100 methods are similar to the CIFAR10 ones, including hyper-parameters.

As seen in Table 2, PatchDrop drops about of the patches in the original image space in CIFAR10, all the while with minimal loss in the overall accuracy. In the case of CIFAR100, we observe that it samples 2.2 patches more than the CIFAR10 experiment, on average, which might be due to higher complexity of the CIFAR100 dataset.

Experiments on ImageNet Next, we test the PatchDrop on ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) [35]. It contains 1.2 million, 50,000, and 150,000 training, validation and test images. For augmentation, we use randomly cropping 224224px area from the 256256px images and perform horizontal flip augmentation. After training the classifiers, we pre-train the policy network for 95 epochs with a learning rate of 1e-4 and batch size of 1024. We then perform the first fine-tuning stage and jointly finetune the HR classifier and policy network for 51 epochs with the learning rate of 1e-4 and batch size of 128. Finally, we add the LR classifier and jointly finetune the policy network and HR classifier for 10 epochs with the same learning rate and batch size. We set to 0.1, 10, and 10 for pre-training and fine-tuning steps.

As seen in Table 2, we can maintain the accuracy of the HR classifier while dropping and of the patches with the Ft-1 and Ft-2 model. Also, we show the learned policies on ImageNet in Fig. 6. The policy network decides to sample no patch when the input is relatively easier as in column 3, and 8.

Analyzing Policy Network’s Actions To better understand the sampling actions of policy network, we visualize the accuracy of the classifier w.r.t the number of sampled patches as shown in Fig. 7 (left). Interestingly, the accuracy of the classifier is inversely proportional to the number of sampled patches. We believe that this occurs because the policy network samples more patches from the challenging and ambiguous cases to ensure that the classifier successfully predicts the label. On the other hand, it successfully learns when to sample no patches. However, it samples no patch (=0) of the time on average in comparison to sampling 4S7 of the time. Increasing the ratio for =0 is a future work of this study. Finally, Fig. 7 (right) displays the probability of sampling a patch given its position. We see that the policy network learns to sample the central patches more than the peripheral patches as expected.

Figure 7: Left: The accuracy w.r.t the average number of sampled patches by the policy network. Right: Sampling probability of the patch IDs (See Fig. 3 for IDs).

6 Improving Run-time Complexity of BagNets

Figure 8: Dynamic BagNet. The policy network processes LR image and sample HR patches to be processed independently by CNN. More details on BagNet can be found in [1].
Acc. (%)
Acc. (%)
Run-time. (%)
BagNet (No Patch Drop) [1] 85.6 16 85.6 16 192
CNN (No Patch Drop) 92.3 16 92.3 16 77
Fixed-H 67.7 10 86.3 9 98
Fixed-V 68.3 10 86.2 9 98
Stochastic 49.1 10 83.1 9 98
STN [19] 67.5 10 86.8 9 112
BagNet (PatchDrop) 77.4 9.5 92.7 8.5 98
Table 3: The performance of the PatchDrop and other models on improving BagNet on CIFAR10 dataset. We use a similar set up to our previous CIFAR10 experiments.

Previously, we tested PatchDrop on fMoW satellite image recognition task to reduce the financial cost of analyzing satellite images by reducing dependency on HR images while preserving accuracy. Next, we propose the use of PatchDrop to decrease the run-time complexity of local CNNs, such as BagNets. They have recently been proposed as a novel image recognition architecture [1]. They run a CNN on image patches independently and sum up class-specific spatial probabilities. Surprisingly, the BagNets perform similarly to CNNs that process the full image in one shot. This concept fits perfectly to PatchDrop as it learns to select semantically useful local patches which can be fed to a BagNet. This way, the BagNet is not trained on all the patches from the image but only on useful patches. By dropping redundant patches, we can then speed it up and improve its accuracy. In this case, we first train the BagNet on all the patches and pre-train the policy network on LR images (4) to learn patches important for BagNet. Using LR images and a shallow network (ResNet8), we reduce the run-time overhead introduced by the agent to of the CNN (ResNet32) using HR images. Finally, we jointly finetune (Ft-1) the policy network and BagNet. We illustrate the proposed conditional BagNet in Fig. 8.

We perform experiments on CIFAR10 and show the results in Table 3. The proposed Conditional BagNet using PatchDrop improves the accuracy of BagNet by closing the gap between global CNNs and local CNNs. Additionally, it decreases the run-time complexity by , significantly reducing the gap between local CNNs and global CNNs in terms of run-time complexity111The run-times are measured on Intel i7-7700K CPU@4.20GHz. The increase in the speed can be further improved by running different GPUs on the selected patches in parallel at test time.

Finally, utilizing learned masks to avoid convolutional operations in the layers of global CNN is another promising direction of our work. [10] drops spatial blocks of the feature maps of CNNs in training time to perform stronger regularization than DropOut [39]. Our method, on the other hand, can drop blocks of the feature maps dynamically in both training and test time.

CIFAR10 (%)
CIFAR100 (%)
ImageNet (%)
fMoW (%)
No Augment. 92.3 69.3 76.5 67.3
CutOut [5] 93.5 70.4 76.5 67.6
PatchDrop 93.9 71.0 78.1 69.6
Table 4: Results with different augmentation methods.

7 Conditional Hard Positive Sampling

PatchDrop can also be used to generate hard positives for data augmentation. In this direction, we utilize the masked images, , learned by the policy network (Ft-1) to generate hard positive examples to better train classifiers. To generate conditional hard positive examples, we choose the number of patches to be masked,

, from a uniform distribution with minimum and maximum values of 1 and 4. Next, given

by the policy network, we choose patches with the highest probabilities and mask them and use the masked images to train the classifier. Finally, we compare our approach to CutOut [5] which randomly cuts/masks image patches for data augmentation. As shown in Table 4, our approach leads to higher accuracy in all the datasets when using original images, , in test time. This shows that the policy network learns to select informative patches.

8 Conclusion

In this study, we proposed a novel reinforcement learning setting to train a policy network to learn when and where to sample high resolution patches conditionally on the low resolution images. Our method can be highly beneficial in domains such as remote sensing where high quality data is significantly more expensive than the low resolution counterpart. In our experiments, on average, we drop a 40-60 portion of each high resolution image while preserving similar accuracy to networks which use full high resolution images in ImageNet and fMoW. Also, our method significantly improves the run-time efficiency and accuracy of BagNet, a patch-based CNNs. Finally, we used the learned policies to generate hard positives to boost classifiers’ accuracy on CIFAR, ImageNet and fMoW datasets.


  • [1] W. Brendel and M. Bethge (2019) Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760. Cited by: §1, Figure 8, Table 3, §6.
  • [2] P. Chrabaszcz, I. Loshchilov, and F. Hutter (2017) A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819. Cited by: §1.
  • [3] G. Christie, N. Fendley, J. Wilson, and R. Mukherjee (2018) Functional map of the world. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6172–6180. Cited by: §1, §1, §5.1, §5.3.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §5.1.
  • [5] T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    arXiv preprint arXiv:1708.04552. Cited by: Table 4, §7.
  • [6] J. R. Fisher, E. A. Acosta, P. J. Dennedy-Frank, T. Kroeger, and T. M. Boucher (2018) Impact of satellite imagery spatial resolution on land use classification accuracy and modeled water quality. Remote Sensing in Ecology and Conservation 4 (2), pp. 137–149. Cited by: §1.
  • [7] M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis (2018) Dynamic zoom-in network for fast object detection in large images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6926–6935. Cited by: §1, §4.1.
  • [8] D. Geudtner, R. Torres, P. Snoeij, M. Davidson, and B. Rommen (2014) Sentinel-1 system capabilities and applications. In 2014 IEEE Geoscience and Remote Sensing Symposium, pp. 1457–1460. Cited by: §1.
  • [9] P. Ghamisi and N. Yokoya (2018) Img2dsm: height simulation from single imagery using conditional generative adversarial net. IEEE Geoscience and Remote Sensing Letters 15 (5), pp. 794–798. Cited by: §1.
  • [10] G. Ghiasi, T. Lin, and Q. V. Le (2018) Dropblock: a regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pp. 10727–10737. Cited by: §6.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2, §2.
  • [12] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §5.2.
  • [13] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. In European conference on computer vision, pp. 646–661. Cited by: §2.
  • [14] J. Jiao, W. Zheng, A. Wu, X. Zhu, and S. Gong (2018) Deep low-resolution person re-identification. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • [15] E. Kang, J. Min, and J. C. Ye (2017) A deep convolutional neural network using directional wavelets for low-dose x-ray ct reconstruction. Medical physics 44 (10), pp. e360–e375. Cited by: §1.
  • [16] J. Ker, L. Wang, J. Rao, and T. Lim (2017) Deep learning applications in medical image analysis. Ieee Access 6, pp. 9375–9389. Cited by: §1.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • [18] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Cehovin Zajc, T. Vojir, G. Bhat, A. Lukezic, A. Eldesokey, et al. (2018) The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §1.
  • [19] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §5.2, Table 1, Table 2, Table 3.
  • [20] P. Li, L. Prieto, D. Mery, and P. J. Flynn (2019)

    On low-resolution face recognition in the wild: comparisons and new techniques

    IEEE Transactions on Information Forensics and Security 14 (8), pp. 2000–2012. Cited by: §2.
  • [21] X. Li, W. Zheng, X. Wang, T. Xiang, and S. Gong (2015) Multi-scale learning for low-resolution person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3765–3773. Cited by: §2.
  • [22] X. Li, Z. Jie, W. Wang, C. Liu, J. Yang, X. Shen, Z. Lin, Q. Chen, S. Yan, and J. Feng (2017) Foveanet: perspective-aware urban scene parsing. In Proceedings of the IEEE International Conference on Computer Vision, pp. 784–792. Cited by: §2.
  • [23] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
  • [24] Y. Lu, T. Javidi, and S. Lazebnik (2016) Adaptive object detection using adjacency and zoom prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2351–2359. Cited by: §2.
  • [25] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §1.
  • [26] K. Malarvizhi, S. V. Kumar, and P. Porchelvan (2016) Use of high resolution google earth satellite imagery in landuse map preparation for urban related applications. Procedia Technology 24, pp. 1835–1842. Cited by: §1.
  • [27] S. Minut and S. Mahadevan (2001) A reinforcement learning model of selective visual attention. In Proceedings of the fifth international conference on Autonomous agents, pp. 457–464. Cited by: §1.
  • [28] M. Mueller, N. Smith, and B. Ghanem (2017) Context-aware correlation filter tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1396–1404. Cited by: §1.
  • [29] X. Peng, J. Hoffman, X. Y. Stella, and K. Saenko (2016) Fine-to-coarse knowledge transfer for low-res image classification. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 3683–3687. Cited by: §2, §5.1.
  • [30] A. Pirinen and C. Sminchisescu (2018) Deep reinforcement learning of region proposal networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6945–6954. Cited by: §2, §4.1.
  • [31] A. Recasens, P. Kellnhofer, S. Stent, W. Matusik, and A. Torralba (2018) Learning to zoom: a saliency-based sampling layer for neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 51–66. Cited by: §2, §5.2, Table 1, Table 2.
  • [32] F. Rembold, C. Atzberger, I. Savin, and O. Rojas (2013)

    Using low resolution satellite imagery for yield prediction and yield anomaly detection

    Remote Sensing 5 (4), pp. 1704–1733. Cited by: §1.
  • [33] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.
  • [34] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017)

    Self-critical sequence training for image captioning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §4.2.
  • [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015-12) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. External Links: Document, ISSN 1573-1405, Link Cited by: §5.3.
  • [36] V. Sarukkai, A. Jain, B. Uzkent, and S. Ermon (2020) Cloud removal from satellite images using spatiotemporal generator networks. In The IEEE Winter Conference on Applications of Computer Vision, pp. 1796–1805. Cited by: §1.
  • [37] E. Sheehan, C. Meng, M. Tan, B. Uzkent, N. Jean, M. Burke, D. Lobell, and S. Ermon (2019) Predicting economic development using geolocated wikipedia articles. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2698–2706. Cited by: §1.
  • [38] E. Sheehan, B. Uzkent, C. Meng, Z. Tang, M. Burke, D. Lobell, and S. Ermon (2018) Learning to interpret satellite images using wikipedia. arXiv preprint arXiv:1809.10236. Cited by: §1.
  • [39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The Journal of Machine Learning Research

    15 (1), pp. 1929–1958.
    Cited by: §2, §6.
  • [40] J. Su and S. Maji (2016) Adapting models to signal degradation using distillation. arXiv preprint arXiv:1604.00433. Cited by: §2, §5.2, Table 1, Table 2.
  • [41] J. Sun, W. Cao, Z. Xu, and J. Ponce (2015) Learning a convolutional neural network for non-uniform motion blur removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 769–777. Cited by: §1.
  • [42] M. Sun, Y. Yuan, F. Zhou, and E. Ding (2018) Multi-attention multi-class constraint for fine-grained image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 805–821. Cited by: §2.
  • [43] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §4.2, §4.2.
  • [44] B. Uzkent, M. J. Hoffman, A. Vodacek, and B. Chen (2014) Feature matching with an adaptive optical sensor in a ground target tracking system. IEEE Sensors Journal 15 (1), pp. 510–519. Cited by: §1.
  • [45] B. Uzkent, M. J. Hoffman, and A. Vodacek (2016) Integrating hyperspectral likelihoods in a multidimensional assignment algorithm for aerial vehicle tracking. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9 (9), pp. 4325–4333. Cited by: §1.
  • [46] B. Uzkent, M. J. Hoffman, and A. Vodacek (2016) Real-time vehicle tracking in aerial video using hyperspectral features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 36–44. Cited by: §1.
  • [47] B. Uzkent, A. Rangnekar, and M. J. Hoffman (2018) Tracking in aerial hyperspectral videos using deep kernelized correlation filters. IEEE Transactions on Geoscience and Remote Sensing 57 (1), pp. 449–461. Cited by: §1.
  • [48] B. Uzkent, A. Rangnekar, and M. Hoffman (2017) Aerial vehicle tracking by adaptive fusion of hyperspectral likelihood maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 39–48. Cited by: §1.
  • [49] B. Uzkent, E. Sheehan, C. Meng, Z. Tang, M. Burke, D. Lobell, and S. Ermon (2019) Learning to interpret satellite images in global scale using wikipedia. arXiv preprint arXiv:1905.02506. Cited by: §1.
  • [50] B. Uzkent, C. Yeh, and S. Ermon (2020) Efficient object detection in large images using deep reinforcement learning. In The IEEE Winter Conference on Applications of Computer Vision, pp. 1824–1833. Cited by: §4.1.
  • [51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.
  • [52] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §2.
  • [53] X. Wang, F. Yu, Z. Dou, T. Darrell, and J. E. Gonzalez (2018) Skipnet: learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 409–424. Cited by: §2.
  • [54] Y. Wang, L. Wang, Y. You, X. Zou, V. Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger (2018) Resource aware person re-identification across multiple resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8042–8051. Cited by: §2.
  • [55] Z. Wang, S. Chang, Y. Yang, D. Liu, and T. S. Huang (2016) Studying very low resolution recognition using deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4792–4800. Cited by: §1, §2, §5.2, Table 1.
  • [56] C. J. C. H. Watkins and P. Dayan (1992) Q-learning. In Machine Learning, pp. 279–292. Cited by: §4.2.
  • [57] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris (2018) Blockdrop: dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8817–8826. Cited by: §2.
  • [58] Y. Yao, X. Li, Y. Ye, F. Liu, M. K. Ng, Z. Huang, and Y. Zhang (2019) Low-resolution image categorization via heterogeneous domain adaptation. Knowledge-Based Systems 163, pp. 656–665. Cited by: §2.
  • [59] L. Yue, H. Shen, J. Li, Q. Yuan, H. Zhang, and L. Zhang (2016) Image super-resolution: the techniques, applications, and future. Signal Processing 128, pp. 389–408. Cited by: §1.
  • [60] R. Zhang, S. Tang, Y. Zhang, J. Li, and S. Yan (2017) Scale-adaptive convolutions for scene parsing. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2031–2039. Cited by: §2.
  • [61] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §1.