While high resolution images contain semantically more useful information than their lower resolution counterparts, processing them is computationally more expensive, and in some applications, e.g. remote sensing, they can be much more expensive to acquire. For these reasons, it is desirable to develop an automatic method to selectively use high resolution data when necessary while maintaining accuracy and reducing acquisition/run-time cost. In this direction, we propose PatchDrop a reinforcement learning approach to dynamically identify when and where to use/acquire high resolution data conditioned on the paired, cheap, low resolution images. We conduct experiments on CIFAR10, CIFAR100, ImageNet and fMoW datasets where we use significantly less high resolution data while maintaining similar accuracy to models which use full high resolution images.READ FULL TEXT VIEW PDF
The combination of high-resolution satellite imagery and machine learnin...
Traditionally, an object detector is applied to every part of the scene ...
Running high-resolution physical models is computationally expensive and...
Ubiquitous bio-sensing for personalized health monitoring is slowly beco...
High-resolution 3D medical images are important for analysis and diagnos...
Searching for small objects in large images is currently challenging for...
The Human visual perception of the world is of a large fixed image that ...
Nevertheless, downsampling is often performed for computational and statistical reasons . Reducing the resolution of the inputs decreases the number of parameters, resulting in reduced computational and memory cost and mitigating overfitting . Therefore, downsampling is often applied to trade off computational and memory gains with accuracy loss . However, the same downsampling level is applied to all the inputs. This strategy can be suboptimal because the amount of information loss (e.g., about a label) depends on the input . Therefore, it would be desirable to build an adaptive system to utilize a minimal amount of high resolution data while preserving accuracy.
In addition to computational and memory savings, an adaptive framework can also benefit application domains where acquiring high resolution data is particularly expensive. A prime example is remote sensing, where acquiring a high resolution (HR) satellite image is significantly more expensive than acquiring its low resolution (LR) counterpart [26, 32, 9]. For example, LR images with 10m-30m spatial resolution captured by Sentinel-1 satellites [8, 36] are publicly and freely available whereas an HR image with 0.3m spatial resolution captured by DigitalGlobe satellites can cost in the order of 1,000 dollars 
. This way, we can reduce the cost of deep learning models trained on satellite images for a variety of tasks, i.e., poverty prediction, image recognition [49, 38], object tracking [46, 45, 48]. Similar examples arise in medical and scientific imaging, where acquiring higher quality images can be more expensive or even more harmful to patients [16, 15].
In all these settings, it would be desirable to be able to adaptively acquire only specific parts of the HR quality input. The challenge, however, is how to perform this selection automatically and efficiently, i.e., minimizing the number of acquired HR patches while retaining accuracy. As expected, naive strategies can be highly suboptimal. For example, randomly dropping patches of HR satellite images from the functional Map of the World (fMoW)  dataset will significantly reduce accuracy of a trained network as seen in Fig. (a)a. As such, an adaptive strategy must learn to identify and acquire useful patches  to preserve the accuracy of the network.
To address this challenges, we propose PatchDrop, an adaptive data sampling-acquisition scheme which only samples patches of the full HR image that are required for inferring correct decisions, as shown in Fig. (b)b. PatchDrop uses LR versions of input images to train an agent in a reinforcement learning setting to sample HR patches only if necessary. This way, the agent learns when and where to zoom in the parts of the image to sample HR patches. PatchDrop is extremely effective on the functional Map of the World (fMoW)  dataset. Surprisingly, we show that we can use only about of full HR images without any significant loss of accuracy. Considering this number, we can save in the order of 100,000 dollars when performing a computer vision task using expensive HR satellite images at global scale. We also show that PatchDrop performs well on traditional computer vision benchmarks. On ImageNet, it samples about of HR images on average with a minimal loss in the accuracy. On a different task, we then increase the run-time performance of patch-based CNNs, BagNets , by 2 by reducing the number of patches that need to be processed using PatchDrop. Finally, leveraging the learned patch sampling policies, we generate hard positive training examples to boost the accuracy of CNNs on ImageNet and fMoW by 2-3.
. The probability of survival linearly decays in the deeper layers following the hypothesis that low-level features play key roles in correct inference. Similarly, we can decay the likelihood of survival for a patch w.r.t its distance from image center based on the assumption that objects will be dominantly located in the central part of the image. Stochastic layer dropping provides only training time compression. On the other hand,[53, 57] proposes reinforcement learning settings to drop the blocks of ResNet in both training and test time conditionally on the input image. Similarly, by replacing layers with patches, we can drop more patches from easy samples while keeping more from ambiguous ones.
Attention Networks Attention methods have been explored to localize semantically important parts of images [52, 31, 51, 42].  proposes a Residual Attention network that replaces the residual identity connections from  with residual attention connections. By residually learning feature guiding, they can improve recognition accuracy on different benchmarks. Similarly,  proposes a differentiable saliency-based distortion layer to spatially sample input data given a task. They use LR images in the saliency network that generates a grid highlighting semantically important parts of the image space. The grid is then applied to HR images to magnify the important parts of the image.  proposes a perspective-aware scene parsing network that locates small and distant objects. With a two branch (coarse and fovea) network, they produce coarse and fine level segmentations maps and fuse them to generate final map.  adaptively resizes the convolutional patches to improve segmentation of large and small size objects.  improves object detectors using pre-determined fixed anchors with adaptive ones. They divide a region into a fixed number of sub-regions recursively whenever the zoom indicator given by the network is high. Finally,  proposes a sequential region proposal network (RPN) to learn object-centric and less scattered proposal boxes for the second stage of the Faster R-CNN . These methods are tailored for certain tasks and condition the attention modules on HR images. On the other hand, we present a general framework and condition it on LR images.
Analyzing Degraded Quality Input Signal There has been a relatively small volume of work on improving CNNs’ performance using degraded quality input signal .  uses knowledge distillation to train a student network using degraded input signal and the predictions of a teacher network trained on the paired higher quality signal. Another set of studies [55, 58] propose a novel method to perform domain adaptation from the HR network to a LR network.  pre-trains the LR network using the HR data and finetunes it using the LR data. Other domain adaptation methods focus on person re-identification with LR images [14, 21, 54]. All these methods boost the accuracy of the networks on LR input data, however, they make the assumption that the quality of the input signal is fixed.
We formulate the PatchDrop
framework as a two step episodic Markov Decision Process (MDP), as shown in the influence diagram in Fig.2
. In the diagram, we represent the random variables with a circle, actions with a square, and utilities with a diamond. A high spatial resolution image,, is formed by equal size patches with zero overlap , where represents the number of patches. In contrast with traditional computer vision settings, is latent, i.e., it is not observed by the agent. is a categorical random variable representing the (unobserved) label associated with , where is the number of classes. The random variable low spatial resolution image, , is the lower resolution version of . is initially observed by the agent in order to choose the binary action array, , where means that the agent would like to sample the -th HR patch . We define the patch sampling policy model parameterized by , as
is a function mapping the observed LR image to a probability distribution over the patch sampling action. Next, the random variable masked HR image, , is formed using and , with the masking operation formulated as . The first step of the MDP can be modeled with a joint probability distribution over the random variables, , , , and , and action , as
In the second step of the MDP, the agent observes the random variables, and , and chooses an action . We then define the class prediction policy as follows:
where represents a classifier network parameterized by . The overall objective, , is then defined as maximizing the expected utility, represented by
where the utility depends on , , and . The reward penalizes the agent for selecting a large number of high-resolution patches (e.g., based on the norm of ) and includes a classification loss evaluating the accuracy of given the true label (e.g., cross-entropy or 0-1 loss).
In the previous section, we formulated the task of PatchDrop as a two step episodic MDP similarly to . Here, we detail the action space and how the policy distributions for and are modelled. To represent our discrete action space for , we divide the image space into equal size patches with no overlaps, resulting in patches, as shown in Fig. 3. In this study, we use regardless of the size of the input image and leave the task of choosing variable size bounding boxes as a future work. In the first step of the two step MDP, the policy network, , outputs the probabilities for all the actions at once after observing . An alternative approach could be in the form of a Bayesian framework where is conditioned on [7, 30]. However, the proposed concept of outputting all the actions at once provides a more efficient decision making process for patch sampling.
In this study, we model the action likelihood function of the policy network,
, by multiplying the probabilities of the individual high-resolution patch selections, represented by patch-specific Bernoulli distributions as follows:
represents the prediction vector formulated as
To get probabilistic values,
, we use a sigmoid function on the final layer of the policy network.
The next set of actions, , is chosen by the classifier, , using the sampled HR image and the LR input . The upper stream of the classifier, , uses the sampled HR images, , whereas the bottom stream uses the LR images, , as shown in Fig. 3. Each one outputs probability distributions, and
, for class labels using a softmax layer. We then compute the weighted sum of predictions via
where represents the number of sampled patches. To form , we use the maximally probable class label: i.e., if and otherwise where represents the class index. In this set up, if the policy network samples no HR patch, we completely rely on the LR classifier, and the impact of the HR classifier increases linearly with the number of sampled patches.
After defining the two step MDP and modeling the policy and classifier networks, we detail the training procedure of PatchDrop. The goal of training is to learn the optimal parameters of and . Because the actions are discrete, we cannot use the reparameterization trick to optimize the objective w.r.t. . To optimize the parameters of , we need to use model-free reinforcement learning algorithms such as Q-learning  and policy gradient . Policy gradient is more suitable in our scenario since the number of unique actions the policy network can choose is and increases exponentially with . For this reason, we use the method  to optimize the objective w.r.t using
where and represent the baseline action vectors. To get , we use the most likely action vector proposed by the policy network: i.e., if and otherwise. The classifier, , then observes and sampled using , on two branches and outputs the predictions, , from which we get : i.e., if and otherwise where represent the class index. The advantage function assigns the policy network a positive value only when the action vector sampled from Eq. 5 produces higher reward than the action vector with maximum likelihood, which is known as a self-critical baseline .
Finally, in this study we use the temperature scaling method  to encourage exploration during training time by bounding the probabilities of the policy network as
Pre-training the Classifier After formulating our reinforcement learning setting for training the policy network, we first pre-train the two branches of , and , on and . We assume that is observable in the training time. The network trained on can perform reasonably (Fig. (a)a) when the patches are dropped at test time with a fixed policy, forming . We then use this observation to pre-train the policy network, , to dynamically learn to drop patches while keeping the parameters of and fixed.
Pre-training the Policy Network (Pt) After training the two streams of the classifier, , we pre-train the policy network, , using the proposed reinforcement learning setting while fixing the parameters of . In this step, we only use to estimate the expected reward when learning . This is because we want to train the policy network to understand which patches contribute most to correct decisions made by the HR image classifier, as shown in Fig. (a)a.
Finetuning the Agent and HR Classifier (Ft-1) To further boost the accuracy of the policy network, , we jointly finetune the policy network and HR classifier, . This way, the HR classifier can adapt to the sampled images, , while the policy network learns new policies in line with it. The LR classifier, , is not included in this step.
Finetuning the Agent and HR Classifier (Ft-2) In the final step of the training stage, we jointly finetune the policy network, , and with the addition of into the classifier . This way, the policy network can learn policies to drop further patches with the existence of the LR classifier. We combine the HR and LR classifiers using Eq. 7. Since the input to does not change, we keep fixed and only update and . The algorithm for the PatchDrop training stage is shown in Alg. 1. Upon publication, we will release the code to train and test PatchDrop.
Datasets and Metrics We evaluate PatchDrop on the following datasets: (1) CIFAR10, (2) CIFAR100, (3) ImageNet  and (4) functional map of the world (fMoW) . To measure its performance, we use image recognition accuracy and the number of dropped patches (cost).
Implementation Details In CIFAR10/CIFAR100 experiments, we use a ResNet8 for the policy and ResNet32 for the classifier networks. The policy and classifier networks use 88px and 3232px images. In ImageNet/fMoW, we use a ResNet10 for the policy network and ResNet50 for the classifier. The policy network uses 5656px images whereas the classifier uses 224224px images. We initialize the weights of the LR classifier with HR classifier  and use Adam optimizer in all our experiments . Finally, initially we set the exploration/exploitation parameter, , to 0.7 and increase it to 0.95 linearly over time.
Reward Function We choose if and - otherwise as a reward. Here, and represent the predicted class by the classifier after the observation of and and the true class, respectively. The proposed reward function quadratically increases the reward w.r.t the number of dropped patches. To adjust the trade-off between accuracy and the number of sampled patches, we introduce and setting it to a large value encourages the agent to sample more patches to preserve accuracy.
No Patch Sampling/No Patch Dropping In this case, we simply train a CNN on LR or HR images with cross-entropy loss without any domain adaptation and test it on LR or HR images. We call them LR-CNN and HR-CNN.
Fixed and Stochastic Patch Dropping We propose two baselines that sample central patches along the horizontal and vertical axes of the image space and call them Fixed-H and Fixed-V. We list the sampling priorities for the patches in this order 5,6,9,10,13,14,1,2,0,3,4,7,8,11,15 for Fixed-H, and 4,5,6,7,8,9,10,11,12,13,14,15,0,1,2,3 for Fixed-V. The patch IDs are shown in Fig. 3. Using a similar hypothesis, we then design a stochastic method that decays the survival likelihood of a patch w.r.t the euclidean distance from the center of the patch to the image center.
Super-resolution We use SRGAN  to learn to upsample LR images and use the SR images in the downstream tasks. This method only improves accuracy and increases computational complexity since SR images have the same number of pixels with HR images.
Attention-based Patch Dropping
In terms of the state-of-the art models, we first compare our method to the Spatial Transformer Network (STN) by. We treat their saliency network as the policy network and sample the top activated patches to form masked images for classifier.
Domain Adaptation Finally, we use two of the state-of-the-art domain adaptation methods by [55, 40] to improve recognition accuracy on LR images. These methods are based on Partially Coupled Networks (PCN), and Knowledge Distillation (KD) .
The LR-CNN, HR-CNN, PCN, KD, and SRGAN are standalone models and always use full LR or HR image. For this reason, we have same values for them in Pt, Ft-1, and Ft-2 steps and show them in the upper part of the tables.
One application domain of the PatchDrop is remote sensing where LR images are significantly cheaper than HR images. In this direction, we test the PatchDrop on functional Map of the World 
consisting of HR satellite images. We use 350,000, 50,000 and 50,000 images as training, validation and test images. After training the classifiers, we pre-train the policy network for 63 epochs with a learning rate of 1e-4 and batch size of 1024. Next, we finetune (Ft-1 and Ft-2) the policy network and HR classifiers with the learning rate of 1e-4 and batch size of 128. Finally, we setto 0.5, 20, and 20 in the pre-training, and fine-tuning steps.
As seen in Table 1, PatchDrop samples only about of each HR image on average while increasing the accuracy of the network using the full HR images to . Fig. 4 shows some examples of how the policy network chooses actions conditioned on the LR images. When the image contains a field with uniform texture, the agent samples a small number of patches, as seen in columns 5, 8, 9 and 10. On the other hand, it samples patches from the buildings when the ground truth class represents a building, as seen in columns 1, 6, 12, and 13.
Also, we perform experiments with different downsampling ratios and values in the reward function. This way, we can observe the trade-off between the number of sampled patches and accuracy. As seen in Fig. 5, as we increase the downsampling ratio we zoom into more patches to maintain accuracy. On the other hand, with increasing , we zoom into more patches as larger value penalizes the policies resulting in unsuccessful classification.
Experiments on CIFAR10/CIFAR100 Although CIFAR datasets already consists of LR images, we believe that conducting experiments on standard benchmarks is useful to characterize the model. For CIFAR10, after training the classifiers, we pre-train the policy network with a batch size of 1024 and learning rate of 1e-4 for 3400 epochs. In the joint finetuning stages, we keep the learning rate, reduce the batch size to 256, and train the policy and HR classifier networks for 1680 and 990 epochs, respectively. is set to -0.5 in the pre-training stage and -5 in the joint finetuning stages whereas is tuned to 0.8. Our CIFAR100 methods are similar to the CIFAR10 ones, including hyper-parameters.
As seen in Table 2, PatchDrop drops about of the patches in the original image space in CIFAR10, all the while with minimal loss in the overall accuracy. In the case of CIFAR100, we observe that it samples 2.2 patches more than the CIFAR10 experiment, on average, which might be due to higher complexity of the CIFAR100 dataset.
Experiments on ImageNet Next, we test the PatchDrop on ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) . It contains 1.2 million, 50,000, and 150,000 training, validation and test images. For augmentation, we use randomly cropping 224224px area from the 256256px images and perform horizontal flip augmentation. After training the classifiers, we pre-train the policy network for 95 epochs with a learning rate of 1e-4 and batch size of 1024. We then perform the first fine-tuning stage and jointly finetune the HR classifier and policy network for 51 epochs with the learning rate of 1e-4 and batch size of 128. Finally, we add the LR classifier and jointly finetune the policy network and HR classifier for 10 epochs with the same learning rate and batch size. We set to 0.1, 10, and 10 for pre-training and fine-tuning steps.
As seen in Table 2, we can maintain the accuracy of the HR classifier while dropping and of the patches with the Ft-1 and Ft-2 model. Also, we show the learned policies on ImageNet in Fig. 6. The policy network decides to sample no patch when the input is relatively easier as in column 3, and 8.
Analyzing Policy Network’s Actions To better understand the sampling actions of policy network, we visualize the accuracy of the classifier w.r.t the number of sampled patches as shown in Fig. 7 (left). Interestingly, the accuracy of the classifier is inversely proportional to the number of sampled patches. We believe that this occurs because the policy network samples more patches from the challenging and ambiguous cases to ensure that the classifier successfully predicts the label. On the other hand, it successfully learns when to sample no patches. However, it samples no patch (=0) of the time on average in comparison to sampling 4S7 of the time. Increasing the ratio for =0 is a future work of this study. Finally, Fig. 7 (right) displays the probability of sampling a patch given its position. We see that the policy network learns to sample the central patches more than the peripheral patches as expected.
|BagNet (No Patch Drop) ||85.6||16||85.6||16||192|
|CNN (No Patch Drop)||92.3||16||92.3||16||77|
Previously, we tested PatchDrop on fMoW satellite image recognition task to reduce the financial cost of analyzing satellite images by reducing dependency on HR images while preserving accuracy. Next, we propose the use of PatchDrop to decrease the run-time complexity of local CNNs, such as BagNets. They have recently been proposed as a novel image recognition architecture . They run a CNN on image patches independently and sum up class-specific spatial probabilities. Surprisingly, the BagNets perform similarly to CNNs that process the full image in one shot. This concept fits perfectly to PatchDrop as it learns to select semantically useful local patches which can be fed to a BagNet. This way, the BagNet is not trained on all the patches from the image but only on useful patches. By dropping redundant patches, we can then speed it up and improve its accuracy. In this case, we first train the BagNet on all the patches and pre-train the policy network on LR images (4) to learn patches important for BagNet. Using LR images and a shallow network (ResNet8), we reduce the run-time overhead introduced by the agent to of the CNN (ResNet32) using HR images. Finally, we jointly finetune (Ft-1) the policy network and BagNet. We illustrate the proposed conditional BagNet in Fig. 8.
We perform experiments on CIFAR10 and show the results in Table 3. The proposed Conditional BagNet using PatchDrop improves the accuracy of BagNet by closing the gap between global CNNs and local CNNs. Additionally, it decreases the run-time complexity by , significantly reducing the gap between local CNNs and global CNNs in terms of run-time complexity111The run-times are measured on Intel i7-7700K CPU@4.20GHz. The increase in the speed can be further improved by running different GPUs on the selected patches in parallel at test time.
Finally, utilizing learned masks to avoid convolutional operations in the layers of global CNN is another promising direction of our work.  drops spatial blocks of the feature maps of CNNs in training time to perform stronger regularization than DropOut . Our method, on the other hand, can drop blocks of the feature maps dynamically in both training and test time.
PatchDrop can also be used to generate hard positives for data augmentation. In this direction, we utilize the masked images, , learned by the policy network (Ft-1) to generate hard positive examples to better train classifiers. To generate conditional hard positive examples, we choose the number of patches to be masked,
, from a uniform distribution with minimum and maximum values of 1 and 4. Next, givenby the policy network, we choose patches with the highest probabilities and mask them and use the masked images to train the classifier. Finally, we compare our approach to CutOut  which randomly cuts/masks image patches for data augmentation. As shown in Table 4, our approach leads to higher accuracy in all the datasets when using original images, , in test time. This shows that the policy network learns to select informative patches.
In this study, we proposed a novel reinforcement learning setting to train a policy network to learn when and where to sample high resolution patches conditionally on the low resolution images. Our method can be highly beneficial in domains such as remote sensing where high quality data is significantly more expensive than the low resolution counterpart. In our experiments, on average, we drop a 40-60 portion of each high resolution image while preserving similar accuracy to networks which use full high resolution images in ImageNet and fMoW. Also, our method significantly improves the run-time efficiency and accuracy of BagNet, a patch-based CNNs. Finally, we used the learned policies to generate hard positives to boost classifiers’ accuracy on CIFAR, ImageNet and fMoW datasets.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180. Cited by: §1, §1, §5.1, §5.3.
Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: Table 4, §7.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
On low-resolution face recognition in the wild: comparisons and new techniques. IEEE Transactions on Information Forensics and Security 14 (8), pp. 2000–2012. Cited by: §2.
Using low resolution satellite imagery for yield prediction and yield anomaly detection. Remote Sensing 5 (4), pp. 1704–1733. Cited by: §1.
Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §4.2.
The Journal of Machine Learning Research15 (1), pp. 1929–1958. Cited by: §2, §6.