Introduction
In the last few years, convolutional neural networks(CNNs) have made great progress in visual recognition tasks such as image classification, object detection, and semantic segmentation
[9, 21, 16]. Although diverse CNNbased architectures have shown stateoftheart performance in many computer vision tasks, this result is based on the assumption that training and test data are drawn from the same distribution. However, when the deep learning applications are actually deployed in realworld settings, it is inevitable to deal with real data generated from a different distribution from the training phase, and the performance cannot be guaranteed. As is the case with geometric variations caused by different poses, deformations, and viewpoints, it can be one of the major challenges that deep learning systems face in reality. To address this issue, several studies have been reported in the literature[29, 17, 5, 30, 28].Apart from the issue of model performance on realworld data, automated decisionmaking systems are usually opaque, which means we are not fully accessible to them. The only thing we can observe is the input and the predicted output of black box model. This raises a natural question about the systems. How much should we trust the predictions of the black box system with the given input? Even if we assume there is an indicator that tells us how much we can trust, what actions should we take to obtain reliable predictions if it tells us that the predicted result is highly uncertain?
For example, let us consider a car number recognition system. The embedded system outputs the predicted number given a license plate. The predictions of the system are usually very accurate given the numbers with typical shapes. However, a crumpled license plate by a car accident (e.g. tilted number 7 written in the plate) would prevent the current system from recognizing the numbers accurately (e.g. predicted as 1). The system would also have high uncertainty in its prediction.
Motivated by the question above, this paper aims to spatially transform the input data to obtain the most reliable predictions by applying the sequence of modified inputs to the blackbox model. If test data is not sampled from the same distribution as the data used for training the blackbox model, the probability of false prediction increases. To improve accuracy and reliability of predictions, the test input data should be transformed to follow the training data distribution of the blackbox model. We refer to the transformation model as a
transform learner. After the test data is modified by the transform learner, the blackbox model maps the transformed data to the output which tells us not only the prediction but also the confidence of the prediction, which is named confidence score here. The schematic description is given in Figure 1(a).The main challenge in our setting is that we do not know what data distribution the black box system has trained from. The prediction and the confidence score are provided by the blackbox model only when a particular input is given. We cannot directly train the transform learner by a gradientbased method because the blackbox model prevents the gradient flow from reaching to the transform learner. To deal with the problem, we introduce a blackbox optimization technique which approximates the gradient of the output by reinforcement learning
[10].Specifically, in this paper, we take a close look at image classification as the blackbox problem. We use a pretrained image classifier for the blackbox model. Input images are transformed by a transform learner, which we call REinforcement Spatial Transform learner (REST). REST is trained to perform geometric transformation on the input images to increase the confidence score. At inference time, an input image is transformed by REST, followed by predicting the class label via the blackbox model.
Related Work
Invariance to geometric transformations
There have been many studies to achieve geometric invariance or equivariance in computer vision problems such as classification and segmentation, and we here review four methods: Data augmentation, Spatial Transformer network, Deformable Convolutional Networks, and Capsule network.
Data augmentation
Data augmentation (DA)[26] is a ubiquitous technique largely used to improves generalization. Although DA is a good starting point for transformation invariance, it is extremely expensive to augment training images with every possible combination of random rotations, shifts, and scales and it is often observed that the models learned with DA are only invariant to known transformations rather than arbitrary changes.
Spatial Transformer Network
Spatial transformer networks (STN)[11] is the first work which learns to apply spatial transformation to warp feature maps in an endtoend fashion. STN consists of three components  localization network, grid generator and sampler. Localization network learns inputdependent transformations and allows the entire network to increase classification accuracy. The drawback of the STN is that it has to be trained with various transformations in the training phase, and we find that STN failed to generalize when unknown, rare transformations are applied to input images.
Deformable Convolutional Networks
In [3]
the authors argue that it is inherently difficult to handle objects with different scales and shapes in regular CNN because their operations, e.g. convolution kernels, maxpooling, have geometrically fixed patterns. Deformable ConvNets tries to model the dense spatial transformations by learning 2D offsets to the regular sampling grid from target tasks instead of parametric transformations. Deformable ConvNets is then applied to semantic segmentation and object detection to demonstrate its efficiency.
Capsule network
Capsules[24]
are represented by a vector which contains the features of an object and its likelihood. Higher capsules are activated only when the group of objects below them is consistent in their orientation and size with each other. The authors tested CapsNet on affNIST dataset to show its robustness to affine transformations. CapsNet was trained on MNIST digits only with random translations and achieved 79% accuracy on affNIST which confirms it generalizes well to small affine transformations.
Deep RL for computer vision
Deep RL methods have been making steady progress in games [18], robotics [14], finance [4], etc., and have recently been expanded to a wide range of computer vision tasks. [2] addresses the problem of detection by learning localization policy to find region proposals which best focus on the object while [13] suggests joint agent detection to reduce the iterations compared to a single agent.
For visual tracking, [32] proposes to dynamically track the object with increased efficiency in search space through selecting sequential actions on candidate bounding boxes. [22] extends to multiobject tracking by modeling each object as an agent and exploiting the collaborative interactions between agents.
Video recognition can be computationally expensive if performing exhaustive search in every frame. For the first time, [20]
authors produce temporalspatial representations then find the most relevant information in video pairs as a Markov decision process (MDP) for face recognition. In action recognition,
[27] propose to reduce the computational burden by selecting only the key frames in skeletonbased videos, and generating the reward with the graphbased convolutional NN.Deep RL has also been applied to image editing and color enhancement. [15] deploys a RLbased method for automatic image cropping where the agent makes sequential decisions on where to crop the original image to maximize the aesthetics score. [31] progressively restores a corrupted image by selecting a specific tool from the toolbox at each step while [19] focuses on automatic color enhancement where the agent takes an interpretable action sequence to produce a retouched image like a human expert. Here, we present a novel RLbased strategy for geometric invariance and this is, to the best of our knowledge, the first work to apply deep RL to its kind.
Method
In this section, we describe the procedure of REST. The purpose of REST is to warp the input image for reliable prediction of the blackbox model. We first define the blackbox model which performs a classification task and provides the confidence scores. Our method, REST, is composed of two modules. The first part is a trainable RL module which outputs the parameters of the transformation when geometrically deformed images are given as states. The other is a warp module which takes an input image and performs warping with the transformation parameters.
Black Box Model
We view the blackbox function as a probabilistic model where the model assigns a probability of the class for an input . For the blackbox image classifier , is an input image, where , , are the height, width, and channels of an input image, and is the class with a set of classes. We formulate the function of blackbox as and , where is the confidence score obtained from the black box . We discuss the definition of confidence score and which method we choose to measure it in the next section. The classifier is trained on the dataset , and is drawn from the data distribution which is regarded as indistribution. When training the REST learner, we are not allowed to access the original dataset to ensure blackbox assumption.
Confidence Score
It is important for neural networks to understand how uncertain they are in their predictions. While there are methods to quantify the predictive uncertainty by training neural networks where the structure is different from that of prevailing deterministic models [8, 1], there are also other approaches to obtain the uncertainty without modifying the existing classifier structure or training procedure [7, 6, 23]. We choose one of the latter methods as we are dealing with the blackbox model. We here set a confidence score to be inversely proportional to the predictive uncertainty obtained by the blackbox model.
One of the most widely used approaches to uncertainty estimation is MonteCarlo (MC) Dropout due to its simplicity
[7, 6]. When the blackbox classifier has a dropout layer, the uncertainty is calculated by performing dropout at the inference time. Another simple method for measuring uncertainty is using a value proportional to the inverse of the prediction probability (which means the prediction probability is a confidence score).While there exists an extensive literature that softmax outputs are sometimes overly confident and thus not sufficient to consider it as confidence scores [7, 6], in our experiment, the output of the softmax itself was a proper indicator for confidence level of predictions. Experimental results of replacing a confidence score with predictive uncertainty by MC dropout are given in Appendix B.
Reinforcement Spatial Transform Learner
REST learner consists of two modules: the RL module and the warp module. Given a distorted input image , the RL module provides which is used for the warp module to transform a distorted image into . As indistribution and the dataset are unknown, we indirectly estimate the possibility of by the confidence score from indistribution sample.
we train the RL module with a new dataset to find a transform parameter for each data that makes a higher confidence score for the transformed image . Let us clarify the state, action, policy, and reward of the RL module. State is an input data in time step with , and action is a parameter used in the warp module . By mapping by , we get a next state . Policy is a probability for each parameter to be selected given input state, where the purpose of training the RL module is to find the optimal policy.
As the agent is trained to maximize a cumulative reward, we formulate a Reward to increase as the confidence score of the image is bigger than .
(1) 
In training mode, we set the confidence score to be the the prediction probability of the target class label, when . Then the equation (1) can be interpreted as a difference between loglikelihood of the target class.
To get higher reward rate when the confidence score gets closer to 1, and to give penalty per step for a shorter length of episode, we modify the reward function as
(2) 
We also perform an ablation study that compares reward function (1) and (2) in Appendix C. In training mode, episode terminates when the confidence score exceeds a threshold value in time step or the number of steps reaches max number. We set a max number to be 10 for all the experiments. In inference mode, we use maximum likelihood, , instead of the prediction probability of the target class, , for the confidence score to determine the termination, and the other parts are the same.
Training
We use PPO algorithm [25]
for the RL module. To make it operate in continuous action space, we modify the actor network where the output is a mean of the Normal distribution with standard deviation to be 1. The reparameterization trick is used to enable backpropagation. After setting the activation function of actor network to the hyperbolic tangent function, we change the output bounds of
to match action bounds so that the parameter exists within action bounds, making stable learning. In the test phase, we reduce the standard deviation of a normal distribution to decrease the randomness of actor network. We train the model using Adam optimization [12] with the learning rate 0.0001 and batch size 256.Method  R  RSc  RSh  RSS  RSST 

BB  69.28  65.52  56.52  53.95  17.99 
REST+BB  97.63  96.30  94.81  93.20  85.05 
Dataset  Method  base  R  RSc  RSh  RSS  RSST 

SVHN  BB  96.03  59.96  56.46  57.05  53.54  26.09 
REST + BB    89.38  88.58  89.00  85.92  83.82  
CIFAR10  BB  93.78  51.58  49.14  50.69  48.16  32.09 
REST + BB    74.33  72.15  70.73  69.46  60.27  
STL10  BB  77.59  41.57  38.46  41.26  37.97  30.90 
REST + BB    62.24  59.18  59.94  56.55  53.20 
Experiment
In this section we demonstrate the effectiveness of our approach in generalization by attaching the REST learner to the blackbox model. To train the RL module, we generate a dataset by applying a random affine transformation to the expected canonical style of the dataset used for training the black box classifier. For example, if we know that a blackbox model classifies a grayscale numeric image, it is plausible to assume that the model is trained by the MNIST dataset, and therefore we generate a dataset by random affine transformation of MNIST images.
We also use the affine transformation for the warp module
. Although 6 parameters are typically used for affine matrix, we factorize the affine matrix into 4 matrices to have 7 parameters,
.(3) 
where , , , and are parameters for rotation, scaling, shearing, and translation, respectively. Details of generating affinewarped dataset and settings of action bounds are described in Appendix A.
As we use the image dataset for experiments, the actor and the critic networks in RL module are constructed by two convolution layers followed by two fully connected layers. We choose 5layer CNN and STN[11] as baselines and test the transformation invariance on affinewarped MNIST, SVHN, CIFAR10, and STL10 datasets. We train the baseline models using Adam optimization[12] with the learning rate 0.0001 and batch size 128.
We start with the distorted MNIST dataset to show the improvement of classification performance in a shifted data distribution. We further test our model in a more challenging realworld dataset such as SVHN, CIFAR10, and STL10. Then, we demonstrate the ability of generalization of our model in a more shifted, arduous setting where disjointing subsets are selected for training and testing the REST learner. Finally, we show that even with training a REST learner with a small number of training data, the performance does not drop significantly, which results in good sample efficiency.
Improvement of Classification Performance
To evaluate the improvement of blackbox's performance by using REST learner, we begin with the blackbox grayscale numeric image classifier. We use the CNN pretrained with MNIST dataset as a blackbox classifier which performs 99% accuracy in MNIST test data. However, when applying randomly rotated MNIST test images (R), the accuracy decreases to 69.28%. By attaching the REST in front of the blackbox model, the accuracy increases back to 97.63%.
We also experiment with more difficult tasks by generating the data with rotation and scaling (RSc), rotation and shearing (RSh), rotation, scaling, and shearing (RSS), and rotation, scaling, shearing, and translation (RSST) of the MNIST test digits.
Examples of the sequence of transformation for each warping method are shown in Figure 3. All the affinely warped input images in the first column are transformed two times to get the final output. The quantitative results are shown in Table 1. As task complexity increases from simple rotation to full affine transformation, the blackbox classifier gets worse for predicting the label. For the RSST method, the accuracy of the black box model is 17.99 percent, which is almost random choice. However, by adding the REST model in front of the blackbox model, it highly improves the performance. The difference in accuracy between applying the REST model and only using blackbox model gets larger as the task becomes more difficult.
Figure 2. shows the sequence of how the distorted input image is transformed by REST. When affinely extended image of number 9 is taken for the first state , the RL module outputs transform parameter . Then, the warp module receives the state and the parameter to produce the next state , which is rotated clockwise, scaled up, and translated to the upperleft direction. The blackbox model maps the transformed image to a confidence score . As the confidence score does not exceed a terminal threshold, the process iterates until the terminal condition is satisfied. An interesting observation is that the final image seems to resemble a style of data sampled from indistribution , which is the MNIST dataset in this case. The number 9 in the image is standing upright and locating at the center of an image with its size similar to typical MNIST data.
We further evaluate our model in realworld datasets. We use three datasets which are SVHN, CIFAR10, and STL10. For the blackbox models, we use CNN models that perform 96.03%, 93.78%, and 77.59% for accuracy, respectively. Table 2 shows the test accuracy of blackbox classifier before and after attaching REST. The performance improves when the REST model is applied at the frontend of the blackbox model.
Generalization
In the previous section, we have shown that the REST works well by interacting with the blackbox model. However, the limitation of the previous experiment is that the training and test data are generated in the same transformation format. In this section, we demonstrate the effect of generalization by generating the training data and test data in a different condition.
We first set the black box model as a general MNIST classifier. We then generate a distorted MNIST dataset by rotation, scaling, and translation (RST) to train the REST learner. We constrain the behavior of transformation by choosing one in (right, left), (up, down), (right, left, up, down) for each transformation R, S, and T, respectively. Therefore, there are 16 possible transformations of RST. We split 16 transformations of RST into two disjoint subsets, and apply each of the subset to MNIST for generating the training data and the test data. Figure 4(a) shows an example of separating 16 transformations into two subsets, each containing 8 transformations.
We compared the REST model with CNN and STN, and the results are shown in Figure 4(b). When the training dataset is generated by 12 RST transformations, STN performs better than ours. However, as the number of RST compositions used for transforming the training data is decreasing, CNN and STN suffer from predicting the correct labels. On the contrary, our model has a slight decrease in accuracy. It means that our model has better generalization in data shift conditions. It is assumed that our model shows a generalization effect because our model takes exploration in the training process, which generates samples that are not present in the training data distribution.
Method  R  RSc  RSh  RSS  RSST 

CNN  87.24  82.91  76.67  70.97  38.16 
CNN++  88.69  85.86  80.48  76.58  39.92 
STN  55.95  38.60  34.87  27.83  14.94 
STN++  93.63  92.13  89.37  85.39  80.80 
REST (ours)  97.02  95.94  93.45  92.79  83.18 
Sample Efficiency
In this section, we examine the effect of sample efficiency of our model. While 55,000 randomly affinewarped data has been generated from 55,000 training MNIST data to train the REST learner, in this section, we only generate 1000 randomly affinewarped data from 1000 training MNIST data. Then we perform the same process for training and test the model. We compare our model with CNN and STN. Both networks are trained by 1000 data. We also trained CNN and STN with 56,000 data, where 55,000 data is MNIST training data and the other 1000 data is randomly affinewarped data. We call it CNN++ and STN++, respectively. All models are tested with 10,000 affinewarped test data and the results are shown in Table 3.
As a few data is used for training CNN and STN, they result in low accuracy for all different types of transformations. Also, although many data is used for training CNN++ and STN++, they also show a low performance because only a small number of training data is in the same distribution with that of test data. On the contrary, our model has a best performance. It is considered that our model shows a sample efficiency because of the exploration process in the training step. In the process of creating a state to train the model, new images are created for each episode. These images can be considered as training data which is implicitly augmented.
Conclusion
Robustness to realworld data is essential for deep learning systems to be successfully deployed in reality. Most studies so far have focused on improving generalization performance on test datasets when a single whole dataset is randomly split into training and test dataset. In other words, the training and test dataset are drawn from the same distribution and have similar sample statistics. Under this experimental assumption, the performance on the realworld data is not necessarily guaranteed even if the system generalizes well on the test dataset.
In this paper, we addressed geometric invariance using deep RL by transforming outofdistribution samples into training distribution of the pretrained black box classifier in the system. We showed that the proposed method, REST, can improve the robustness of deep learning systems to various image warping. Specifically, as the complexity of the task gradually increased from simple rotation to full affine transformation, i.e. from one to six degrees of freedom, the relative performance of REST over the blackbox model also increased accordingly.
We analyzed the generalization performance on unknown transformations by defining 16 disjoint subsets of affine transformations. REST generalized better as we trained it with a fewer number of transformation combinations while more of novel and unseen transformations were given at test time. Lastly, we experimented baseline methods with only 1000 affinewarped training data and showed REST is efficient in learning with a small number of samples as well. The action space of our method is focused on geometric transformation in this work but can be extended to other image processing techniques such as auto exposure, white balancing, edge enhancement, noise reduction to fill the gap between controlled experimental settings and realworld scenarios in the future works.
Acknowledgments
This work is in part supported by Basic Science Research Program (NRF2017R1A2B2007102) through NRF funded by MSIT, Technology Innovation Program (10051928) funded by MOTIE, BioMimetic Robot Research Center funded by DAPA (UD190018ID), Samsung Electronics AI Grant, MSITIITP grant (No.2019001367, BabyMind), INMAC, and BK21plus.
References

[1]
(2015)
Weight uncertainty in neural network.
In
International Conference on Machine Learning
, pp. 1613–1622. Cited by: Confidence Score.  [2] (2015) Active object localization with deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2488–2496. Cited by: Deep RL for computer vision.
 [3] (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: Deformable Convolutional Networks.
 [4] (2016) Deep direct reinforcement learning for financial signal representation and trading. IEEE transactions on neural networks and learning systems 28 (3), pp. 653–664. Cited by: Deep RL for computer vision.
 [5] (2018) A rotationallyinvariant convolution module by feature map backrotation. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 784–792. Cited by: Introduction.
 [6] (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: Confidence Score, Confidence Score, Confidence Score.
 [7] (2016) Uncertainty in deep learning. University of Cambridge. Cited by: Confidence Score, Confidence Score, Confidence Score.
 [8] (2011) Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356. Cited by: Confidence Score.

[9]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: Introduction.  [10] (2019) Neural network gradientbased learning of blackbox function interfaces. In International Conference on Learning Representations, External Links: Link Cited by: Introduction.
 [11] (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: Spatial Transformer Network, Experiment.
 [12] (201412) Adam: a method for stochastic optimization. International Conference on Learning Representations, pp. . Cited by: Training, Experiment.
 [13] (2017) Collaborative deep reinforcement learning for joint object search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1695–1704. Cited by: Deep RL for computer vision.
 [14] (2016) Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: Deep RL for computer vision.
 [15] (2018) A2rl: aesthetics aware reinforcement learning for image cropping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8193–8201. Cited by: Deep RL for computer vision.
 [16] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: Introduction.
 [17] (2016) Learning rotation invariant convolutional filters for texture classification. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2012–2017. Cited by: Introduction.
 [18] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: Deep RL for computer vision.
 [19] (2018) Distortandrecover: color enhancement using deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5928–5936. Cited by: Deep RL for computer vision.
 [20] (2017) Attentionaware deep reinforcement learning for video face recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3931–3940. Cited by: Deep RL for computer vision.
 [21] (2016) You only look once: unified, realtime object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: Introduction.
 [22] (2018) Collaborative deep reinforcement learning for multiobject tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 586–602. Cited by: Deep RL for computer vision.
 [23] (2018) A scalable laplace approximation for neural networks. In International Conference on Learning Representations, External Links: Link Cited by: Confidence Score.
 [24] (2017) Dynamic routing between capsules. In Advances in neural information processing systems, pp. 3856–3866. Cited by: Capsule network.
 [25] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Training.
 [26] (2003) Best practices for convolutional neural networks applied to visual document analysis.. In Icdar, Vol. 3. Cited by: Data augmentation.
 [27] (2018) Deep progressive reinforcement learning for skeletonbased action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5323–5332. Cited by: Deep RL for computer vision.
 [28] (2017) Harmonic networks: deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028–5037. Cited by: Introduction.
 [29] (2014) Scaleinvariant convolutional neural networks. arXiv preprint arXiv:1411.6369. Cited by: Introduction.
 [30] (2018) PRIN: pointwise rotationinvariant network. arXiv preprint arXiv:1811.09361. Cited by: Introduction.
 [31] (2018) Crafting a toolchain for image restoration by deep reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2443–2452. Cited by: Deep RL for computer vision.
 [32] (2017) Actiondecision networks for visual tracking with deep reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2711–2720. Cited by: Deep RL for computer vision.
Comments
There are no comments yet.