PyTorch implementation of PatchAutoAugment
Data augmentation (DA) plays a critical role in training deep neural networks for improving the generalization of models. Recent work has shown that automatic DA policy, such as AutoAugment (AA), significantly improves model performance. However, most automatic DA methods search for DA policies at the image-level without considering that the optimal policies for different regions in an image may be diverse. In this paper, we propose a patch-level automatic DA algorithm called Patch AutoAugment (PAA). PAA divides an image into a grid of patches and searches for the optimal DA policy of each patch. Specifically, PAA allows each patch DA operation to be controlled by an agent and models it as a Multi-Agent Reinforcement Learning (MARL) problem. At each step, PAA samples the most effective operation for each patch based on its content and the semantics of the whole image. The agents cooperate as a team and share a unified team reward for achieving the joint optimal DA policy of the whole image. The experiment shows that PAA consistently improves the target network performance on many benchmark datasets of image classification and fine-grained image recognition. PAA also achieves remarkable computational efficiency, i.e 2.3x faster than FastAA and 56.1x faster than AA on ImageNet.READ FULL TEXT VIEW PDF
Data augmentation (DA) techniques aim to increase data variability, and ...
In recent years, deep learning has achieved remarkable achievements in m...
Deep neural networks have emerged as very successful tools for image
Recently, DNN model compression based on network architecture design, e....
Data augmentation (DA) is commonly used during model training, as it
Automated machine learning (AutoML) usually involves several crucial
Um CRDT é um tipo de dados que pode ser replicado e modificado
PyTorch implementation of PatchAutoAugment
Data Augmentation (DA) is an important technique to reduce overfitting risk by increasing the variety of training data. Simple augmentation methods, such as rotation and horizontal flipping, have been proved effective in many vision tasks including image classification [28, 10, 8], object detection [33, 44], etc.
have drawn much attention for its superior performance compared to human heuristic DA methods. They aim to automatically find effective data augmentation policies, thus greatly reducing the burden of tuning many hyperparameters in DA (e.g. operation, magnitude and probability). The pioneer work AutoAugment (AA) first uses reinforcement learning to automatically search for the optimal DA policy and provides significant performance improvements. But the computational cost and time complexity of AA remain high. Many studies [31, 23, 20] use hyperparameter optimization or a differentiable policy search pipeline on the basis of AA to reduce the costs and obtain results comparable to AA. Adversarial AA (AdvAA)  and OHL-Auto-Aug  are online methods that reduce a lot of searching time. However, these approaches search for policies at the image-level, but ignore that the optimal policies of different regions in an image may be diverse, such as foreground object and background.
Some works [17, 50, 52] heuristically explore diverse augmentations on different regions in the form of patches. But the performance of these manually designed methods is usually not as good as automatic DA methods.
In this paper, we explore patch-level AutoAugment that automatically searches for the optimal augmentation policies for the patches of an image and propose Patch AutoAugment (PAA). In PAA, an image is divided into a grid of patches (Refer to Figure 1) and each patch augmentation operation is controlled by an agent (See Figure 2). We regard the search for the joint optimal policy of patches in an image as a fully cooperative multi-agent task. PAA uses Multi-Agent Reinforcement Learning (MARL) algorithm to solve the problem. At each step, the policy network outputs augmentation operation of each patch according to the content of each patch and the semantics of the entire image. The agents cooperate with each other to achieve the optimal augmentation effect of the entire image by sharing a team reward. Augmentation operations are performed on patches with a certain probability and magnitude. The augmented images are fed into the target network for training. Inspired by adversarial training [59, 49], the training loss of target network serves as the reward signal for updating the augmentation network. Through adversarial training and MARL optimization, PAA provides meaningful regularization for target network and improves the overall performance.
Our contributions can be summarized as follows:
To the best of our knowledge, we are the first to study patch-level automatic augmentation policy. Our method can search for the optimal policy for each patch according to its content and the semantics of image.
We model the search for joint optimal DA policy of patches in an image as a fully cooperative multi-agent task and use MARL algorithm to solve the problem.
The experimental results show that PAA has a significant performance improvement at a relatively low computational cost. Visualization results also demonstrate that we can provide some instructive suggestions on which operation to choose for patches with different content, which is dynamic during training.
Data augmentation (DA) is designed to increase the diversity of data to reduce overfitting . Manually designed methods require additional expertise. Therefore, many studies have explored automatic data augmentation methods in order to overcome the performance limitations of human heuristics caused by randomness. Smart Augmentation 
uses a network to generate augmented data by merging pairs of samples from the same classes. Generative adversarial networks create augmented data directly[40, 45, 37, 61].
Several current studies aim to automate the process of finding the optimal combination of predefined transformation functions, which is different from the automatic generation methods. AutoAugment (AA)  uses Reinforcement Learning (RL) to train RNN controller to search the best augmentation policy, which has information about what image processing operation to use, the probability of using the operation and the magnitude. PBA  employs hyperparameter optimization to reduce the ultra-high computational cost in AA. Fast AutoAugment (FastAA)  uses density matching to reduce complexity and obtains results comparable to AA. Faster AutoAugment  proposes a differentiable policy search pipeline to further reduce the costs. Adversarial AA  transfers from the proxy tasks to the target tasks and proposes an adversarial framework to jointly optimize the target network and the augmentation network. OHL-Auto-Aug  proposes a new online data augmentation scheme based on the augmentation network co-trained with the target network.
Many region-based data augmentation methods [11, 60, 41, 17, 52, 50] have been widely used. Some early pioneering DA methods randomly remove [11, 60, 41] or replace [55, 47] patches, such as CutOut  and CutMix . Some recent works believe that regions with different content are performed augmentations in different ways rather than randomly selecting regions. KeepAugment  uses saliency-map to select important regions to keep untouched. Attentive CutMix  utilizes the attention maps to select attentive patches to be replaced. Using saliency detection, SaliencyMix  restricts the replaced patch to the objects of interest. However, they are all human-crafted DA methods and usually have worse performance than automatic DA methods.
Distinct from applying Reinforcement Learning (RL) algorithm directly to multi-agent systems, the most significant characteristic of Multi-Agent Reinforcement Learning (MARL) is the cooperation between agents [48, 35, 14]. Due to the limited observation and action of a single agent, cooperation is necessary in the reinforced multi-agent system to achieve the common goal. Compared with independent agents, cooperative agents can improve the efficiency and robustness of the model [38, 58, 1]. Many vision tasks use MARL to interact with the public environment to make decisions, with the goal of maximizing the expected total return of all agents, such as image segmentation [30, 19, 34], image processing .
As above mentioned, the images are divided into patches and it’s necessary to achieve joint optimal augmentation through the search for optimal augmentation policies for patches. It’s obvious that patches in one image are correlated to each to some extent, therefore, we adopt MARL algorithm to solve this cooperative multi-agent task. We get the joint optimal policy of the whole image through the cooperation of agents in MARL. In this section, we introduce Patch AutoAugment (PAA) in detail. We first present the specific formulation of MARL. Then we introduce algorithms and training process. Finally, we analyze PAA from a theoretical perspective and discuss the search complexity.
As shown in Figure 2, we divide the image into equal-sized, non-overlapping small patches. is the -th patch in image . Then, we treat the search for the optimal augmentation policies for patches in an image to achieve the joint optimal augmentation policy of the whole image as a fully cooperative multi-agent task, which can be described as Dec-POMDP  defined by a tuple . The augmentation of each patch is controlled by an agent. There are agents.
State. The state describes the situation of the entire environment. The state contains the observation information of all agents. In our MARL model, we use a backbone (such as ResNet-18 ) to extract feature from the whole image as the state.
Observation. The observation of agent is the visible part of the state, which is unique. Each agent draws its individual observation, namely, the observation of the -th agent is the feature of the patch.
Action. is the action of the -th agent. is the pre-defined action set that consists of fifteen common image processing function: Shear, RandomErasing, MixUp, CutMix, Rotate, Contrast, Invert, Equalize, Gray, Posterize, Contrast, Color, Brightness, Sharpness and Dropout. The details can be found in Appendix. Each agent chooses an action , which is the transformation performed on the patch. The joint action is actually an action map formed by the operations of patches.
Reward. Reward penalizes the agent which selects the augmentation operations without gain. Similar to [59, 32], we utilize adversarial learning to guide the joint training of the augmentation network and the target network. PAA expects to locate the weakness of the target network. Specifically, the augmentation network hopes to increase the training loss of the target network. Through adversarial games, the policy networks attempt to select suitable samples. Therefore, we consider using the training loss of the target network as reward:
Where and denote inputs and labels in supervision tasks. is the augmentation network, is the target network. In classification task, is the cross-entropy loss. In our PAA model, all agents share a unified team reward for cooperation to make the joint augmentation policy better.
Policy. The policy of each patch is denoted as . It describes the probability of selecting action in state and observation . The objective of PAA is to learn the joint optimal policy that aims to maximize the total expected reward. PAA allows the policy networks of all agents to share the same parameters, which is a common method in MARL for communication between agents [21, 5, 6].
Policy Network. The augmentation network consists of all policy networks of patches. In this paper, we adopt the Advantage Actor Critic (A2C) [36, 4] for the PAA problem and extend it to the fully convolutional form. A2C is built on the Actor-Critic method, which has two networks. The actor network outputs the policy (probability through softmax) of taking and then obtains the action map. The critic network evaluates the value function of the current state. is called the advantage function and the gradient for is computed as follows:
is the learning rate of the augmentation network. By updating policy networks with the policy gradient, we use critics to guide actors to choose those augmentation policies that maximize expected reward. The loss of the critic network is the square loss of the actual reward and the estimated state value. And the gradientis computed as follows:
Execution Probability and Magnitude. Each operation has two associated parameters: the probability of applying the operation on the patch and the magnitude. We observe that over-hard samples might have a negative impact on network convergence. As the name suggests, hard sample refers to samples that are difficult to learn (with large loss). If all patches execute the selected operations, the sample would be too hard. Therefore, we introduce as the probability. The execution flag
is sampled from the Bernoulli distribution. means to transform the patch and means to keep the original. Due to the impact of the large search space on performance, we exclude probability and magnitude from the policy. We discuss in detail in Section 3.3.
Inference and Training. As shown in Figure 2, PAA firstly divides an image into a grid of patches, and each operation of them is controlled by an agent. We use a backbone to extract features from the whole image as the state. Each agent draws its individual observation, namely, the feature of patch. According to the state (global) and observation (local), the actor networks output the operations (constitute an action map) of patches. The augmented mini-batch is processed with a certain probability and magnitude and then input to the target classification network for parameters update. The training loss of the target network is fed back to the critic network to update the policy. The pseudo codes of training PAA is shown in Algorithm 1.
|Dataset||Model||Baseline||CutOut ||AA ||FastAA ||PAA|
Theoretical Motivation. We try to analyze the motivation of PAA from the theory of risk minimization. In Vicinal Risk Minimization (VRM) principle , data augmentation constructs adjacent values of training samples to prevent the target network from memorizing the training data. We have a set of training data and need to find the function describes the relationship between input and label . In the light of VRM, we construct a dataset sampling from the vicinal distribution, and aim to minimize the empirical vicinal risk:
Using a more granular augmentation, PAA covers the vicinal distributions that cannot be included by the image-level augmentation. PAA proposes a more generic vicinal distribution:
Where is the patch of the image , and are the augmented forms of the patch and the image, respectively. There are a total of operation functions, and is the among them.
Search Complexity. The previous automatic search policy [7, 31] contains: operation, probability and magnitude. AdvAA  and OHL-Auto-Aug  directly remove probability from policy to reduce the search space. Our PAA further reduces the search space by applying MARL to only control operation. The probability and magnitude in PAA are artificially designed inspired by [9, 23, 59, 32]. Similar to , we also choose a constant magnitude range for each operation. We discuss how to choose in Section 4.8. We show that a reasonable reduction of the search space is beneficial to reduce the time cost and improve the performance of the target network in Section 4.7.
|GPU hours||AA ||FastAA ||AdvAA ||PAA|
|ResNet-50||76.28 / 93.18||77.01 / 93.42||77.35 / 93.82|
|ResNet-152||78.31 / 93.98||78.63 / 94.23||78.92 / 94.46|
) on ImageNet. The standard deviation of PAA results is less than 0.15 (). All models are trained from scratch. Following the previous work, we compare our PAA with baseline method and AA .
and three fine-grained object recognition datasets (CUB-200-2011, Stanford Dogs 
and Stanford Cars).
We implement our model on PyTorch, and train it on the NVIDIA 1080Ti GPUs. Compared with the previous methods using image-by-image sequential transformations[7, 31, 11]
, we speed up the process by performing parallel transformations on tensor. Specifically, we pick out the patches that perform the same operation and put them into a new tensor. Tensor transformations on GPU can be realized by Kornia111
Kornia is a differentiable computer vision library for PyTorch. We use it to accelerate augmentation operation on tensors.
. And the positions of patches in image are recorded to restore a new augmented batch. Parallel processing significantly reduces the computational cost. The policy network is a multi-layer convolutional neural network (CNN) used in all experiments.
Both CIFAR-10 and CIFAR-100 have 50,000 training examples. Each image of size belongs to one of 10 categories. We compare Patch AutoAugment (PAA) with baseline, CutOut , AutoAugment (AA), Fast AutoAugment (FastAA). The results of Adversarial AutoAugment (AdvAA)  are not reported due to unavailability of official source. Following [7, 31, 59], we evaluate PAA using Wide-ResNet-28-10 , Shake-Shake (26 2x32d) , Shake-Shake (26 2x96d) , Shake-Shake (26 2x112d) , and Pyramid-Net+ShakeDrop [18, 53]. Each model is trained from scratch.
: standardizing the data, horizontally flipping with 0.5 probability, zero-padding and random cropping with. And we compare our PAA with CutOut , which sets the pixels of randomly selected patch in an image as zero. For PAA, due to the small size of images in CIFAR-10 and CIFAR-100, we set the number of patches to 2. And the execution probability
is sampled from the standard uniform distribution. In the ablation study, we discuss the design of in depth.
Result and Analysis. The means and standard deviations (std) of experimental results are shown in Table 1. We achieve the best performance compared with the SOTA methods on most networks. Specifically, we achieve 97.43 on the Wide-ResNet-28-10 model, which is 0.11 better than the SOTA AA method and 1.3 better than baseline. Furthermore, PAA achieves accuracy of 96.86 on the ShakeShake(26 2x32d), which is 0.18 higher than AA. We also compare our PAA with other SOTA methods on CIFAR-100, which is shown in Table 1
. We use T-test to perform statistical significance tests. PAA has a significant improvement compared to AA on most models, with a 95 confidence level.
The image size and the number of patches on CIFAR-10 and CIFAR-100 are too small to highlight the effect of PAA. But PAA still improves performance of the target network with less computational cost.
ImageNet dataset has about 1.2 million training images and 50,000 validation images with 1000 classes. We compare our Patch AutoAugment (PAA) with baseline, AutoAugment (AA). In this section, we evaluate our methods on ResNet-50 , and ResNet-152 , which are trained from scratch.
Experiment Setting. For the baseline augmentation [24, 46], we randomly resize the input image and crop it to , then horizontally flip it with a probability of 0.5, and standard the data. We set in PAA, and the execution probability is sampled from .
Result and Analysis. The average (std) of the experimental results on ImageNet are presented in Table 3. It can be noted that PAA achieves a top-1 accuracy of 77.35 and a top-5 accuracy of 93.82 on ResNet-50. Compared with the current state-of-the-art AA, PAA achieves a significant top-1 accuracy improvement of 0.34. In addition, the PAA performance on the ResNet-152 model still maintains a reasonable level. PAA can achieve a top-1 accuracy improvement of 0.29 compared with AA. The results of statistical significance tests (with 95 confidence level)  also show that PAA has a significant improvement.
We evaluate the performance of our proposed PAA on three standard fine-grained object recognition datasets. CUB-200-2011  consists of 6,000 train and 5,800 test bird images distributed in 200 categories. Stanford Dogs  consists of 20,500 images of 120 breeds of dogs. Stanford Cars  contains 16,185 images in 196 classes. The image size in the above datasets is . According to previous work [12, 3], we use pretrained ResNet-50 and ResNet-152 models.
Experiment Setting. We use the same baseline and AA settings as the experimental settings on ImageNet. We set for PAA. We linearly increase the probability
from 0.5 to 1 as the number of epoch increases.
Result and Analysis. As shown in Table 4, the performance of PAA is consistently better than other methods on ResNet-50 and ResNet-152. AA does not exceed the baseline on CUB-200-2011, which is not shown in . On ResNet-50, compared to the baseline, the performance of PAA on CUB-200-2011, Stanford Dogs and Stanford Cars has increased by 0.47, 0.86, and 1.16 respectively. Applying PAA to ResNet-152 on Stanford Dogs improves its validation accuracy from 85.26 to 85.97. Our results show that PAA significantly exceeds the baseline and AA. We also use T-test  (with 95 confidence level) to show that there is a statistically significant difference between PAA and AA. Our model achieves the SOTA accuracy on these challenging fine-grained tasks.
We estimate the policy searching time and training time of PAA on CIFAR-10 and ImageNet. We compare PAA with AA , FastAA  and AdvAA . The computational cost is mainly used for searching policies and training the target network. As shown in Table 2, compared to the previous work, PAA has the lowest total computational cost, and the searching time is almost negligible.
There are three main reasons from our perspective: First, similar to [59, 32], we jointly optimize the augmentation network and the target network. Online augmentation policy saves most searching time. Second, PAA reasonably reduces the search space as discussed in Section 3.3. We use MARL to only control the operation, which simplifies the model to accelerate the network convergence to reduce training time. The total number of PAA model parameters (about 0.23M) is less than 1 of the target network (ResNet50: about 25.5M). Thirdly, we augment tensors consist of the patches with the same operations, thereby achieving parallel processing. Compared to other online methods, PAA further reduces the training time of the target network. PAA achieves the SOTA on the total computational cost.
In this section, we further illustrate the effectiveness of the proposed PAA scheme by visualizing the feature map and the percentage of each operation.
Feature Visualization. To intuitively show the impact of PAA on the features learned by CNN, we use GradCam  to visualize the focus of ResNet-50 trained with baseline, AA and PAA, as shown in Figure 3. After adding baseline or AA, the target network focuses on other parts rather than the object itself. For example, in Figure 3, we find that the network is concentrated on the branch where the bird stands, and treats the irrelevant branch as part of the bird. However, these parts are biases  in the data itself. The feature map responses of PAA concentrate on the part of the object, such as birds’ beak. The feature visualization results show that PAA enables the classification network to truly focus on areas that are more discriminative. More visualization results are provided in Appendix.
Policy Visualization. In addition, we discuss the guiding significance of PAA for different types of patches. We divide patches into four levels according to the average patch importance calculated by GradCam : (1) most important (2) important (3) normal (4) not important. We collect the operation statistics for each class of patches, and draw a stacked area chart showing the percentages of operations over time. We take the training process of ResNet-50 on CUB-200-2011 as an example, as shown in Figure 4.
We find that at the beginning of the training process, the selected actions are messy since the MARL network is in the exploratory stage. In the middle of the training process, different types of patches have their own policies that prefer to select certain special operations. At the tail end of the training, the adversarial game between the augmentation network and the target network has reached a balance, so the percentages of all operations are almost the same.
Figure 4 also shows that for the patches that the network cares most about, RandomErasing and CutMix account for the majority of augmentation operations. The second and third important patches tend to select Shear and Dropout respectively. The patches least noticed by the network prefer color transformations. Therefore, we draw the conclusion that color transformation is mostly picked for patches in the background. Geometric transformations, such as RandomErasing and Shear, are chosen for the important patches where the object is located.
To better explain the effectiveness of MARL in our PAA, we perform ablation studies on CIFAR-10 and CIFAR-100 with Wide-ResNet-28-10, and ImageNet with ResNet-50 respectively, as shown in Table 6. The parameter settings are consistent with the previous settings. MARL refers to putting the variable into the policy and searching with MARL.
MARL vs. Random. Random refers to sampling a value of probability from a uniform distribution or randomly selecting an operation from the set of transformations with the same probability. (1) Comparing Patch RandomAugment (PRA: and are all Random) with PAA, we believe that MARL plays a guiding role in selecting operations and improves network performance more than blindly selecting augmentation operations. (2) The result of Patch RandomAutoAugment (PRAA: is Random but is MARL) shows that automatic control of operations is more important than that of probability. (3) The performance of PatchProbability AutoAugment (PPAA: and are all MARL) drops significantly. We analyze the possible reason for the performance degradation is that searching for probability and operation causes the search space to be too large to find the optimal augmentation policy.
MARL vs. SARL. SARL represents that the single-agent reinforcement learning algorithm is directly applied to the multi-agent system, and each agent is independent of each other. That is, SARL simply inputs patches into image-level automatic DA methods. The experimental results show that the cooperation between patches is crucial to achieve the optimal augmentation policy of the whole image, which is composed of all optimal patches’ policies, so that the target network can achieve better performance.
In this section, we first discuss two important parameters and give suggestions on how to set them. Then we summarize the importance of PAA and why we automatically search for augmentation policies at the patch-level.
Grid Size. The number of patches is an important parameter. Table 7 shows the validation accuracy with on ImageNet and CUB-200-2011. The image sizes on the two datasets are and . It can be seen that with the increase of , the recognition accuracy first increases and then decreases. On the two datasets, when N = 4, the best performance is achieved.
If we set the number of to be small, PAA fails to effectively use local regularization, so the advantage of PAA would be limited. In particular, when , the image is not divided, which is similar to the previous automatic DA at the image-level. When we set too large, the vicinal area is too large and may be far from the original distribution (see Section 3.3). Some examples are shown in Figure 7.
Execution Probability. To demonstrate the influence of execution probability , we conduct experiments by adopting different values of . (1) samples from a standard uniform distribution (2) is a fixed value (3) changes linearly. As shown in Table 5, we empirically design the choice of according to the target network initialization. We find that when the network is trained from scratch, is sampled from the uniform distribution, which works best. When using a pretrained network,
increases linearly and the performance is optimal. We analyze that the pretrained network has a certain classification ability at the beginning, and it is necessary to generate more difficult samples to challenge the classifier.
Why patch-level automatic DA. We automatically search for the optimal policies for patches and achieve the the optimal joint policy of the whole image. It is more effective than simple heuristic region-based DA methods. Compared with image-level automatic DA, PAA can provide more visual modes and cover more vicinal distributions. PAA chooses augmentation policies for different parts of the image and degenerates into image-level automatic DA under certain circumstances. Through analyzing the visualization results, we find that our patch disrupting method makes the target network focus on discriminative areas. For the tasks with high requirements for details, such as fine-grained object detection, our model is more advantageous.
In this paper, we propose Patch AutoAugment (PAA), an automatic data augmentation approach at patch-level. Our method adopts Multi-Agent Reinforcement Learning to automatically search for the optimal augmentation policy for each patch. The joint optimal policy of the entire image is optimized through the cooperation of agents. PAA utilizes adversarial training to achieve online training. Extensive experiments demonstrate that PAA can improve the target network performance with low computational cost in many tasks. We also use visualization to further prove that PAA is beneficial for the network to focus on the most important or representative regions within the image. In future work, we will investigate different schemes on dividing different regions, and introduce a probability value for applying augmentation to the whole image.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5157–5166, 2019.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3379–3386, 2019.
Kornia: an open source differentiable computer vision library for pytorch.In Winter Conference on Applications of Computer Vision, 2020.
International Conference on Machine Learning, pages 2576–2585. PMLR, 2019.
Microsoft coco: Common objects in context.In European conference on computer vision, pages 740–755. Springer, 2014.
A survey on image data augmentation for deep learning.Journal of Big Data, 6(1):1–48, 2019.
|Layer||Actor network||Critic network|
|Feature Generator: ResNet-18|
Model Architecture of our PAA. For each convolution layer, we list the input dimension, output dimension, kernel size, stride, and padding. For the fully-connected layer, we provide the input and output dimension.
In this section, we provide the detailed model architecture for each component in our PAA model: Actor network, Critic network, and they share the Feature Generator network. We use the pretrained ResNet-18 (excluding the final avgpool and softmax layer) to extract the feature of image and patch. As shown in Table8.
|Operation Name||Description||magnitudes (parameters in Kornia)|
|DropOut||Set all pixels in this patch to the average value of the patch.||-|
|Gray||The patch is transformed to grayscale||-|
|Invert||Invert the pixels of the patch||-|
|Posterize||Reduce the number of bits for each pixel to magnitude bits.||bits=3|
|RandomErasing||Erases a random rectangle of a patch.||
|Rotation||Rotate the patch magnitude degrees.||degrees=30.0|
|Color||Adjust the color balance of the image.||hue=(-0.3, 0.3)|
|Equalize||Equalize the image histogram.||-|
We list the fifteen common augmentation operations. For a fair comparison with the previous work[7, 23, 31, 59], we use the same augmentation operations. We explained these operations in detail as shown in Table 9. Kornia  is a computer vision library for PyTorch. We use it to implement augmentations on tensor. Therefore, in table 9, We also give the parameter values we set for the functions in the Kornia library. These parameters correspond to the magnitude of augmentations. Some operations, such as DropOut and CutMix, have no parameters.
We detail the classifier and our PAA model hyperparameters on CIFAR-10, CIFAR-100, ImageNet, CUB-200-2011 and Stanford Dogs in Table 10. We do not specifically tune these hyperparameters, and all of these are consistent with previous works [7, 31, 59, 12, 3], except for the number of epochs on ImageNet and fine-grained tasks. In addition, we use SGD optimizer with an initial learning rate 1e-4 to train Actor network and Critic network.
We devide patches into four categories according to the average importance of patches calculated by Grad-Cam. For the CUB-200-2011 dataset, the most important patches(such as birds’head or wings) prefer to select RandomErasing and Cutmix operations; Next, the second and third important patches(such as birds’ belly or toes) prefer Shear, DropOut and Translate. And not important pathes (such as background) prefer to color transformation operation. We show some patches that choose RandomErasing, Shear, Translate, and Color. From Figure 6, the most important patches prefer spatial transformations, while unimportant patches prefer color transformations. It can be seen from the experimental results that our model PAA plays a guiding role in patch selection augmentation policy. Depending on the type of patch, PAA selects different augmentation operations. We use Grad-Cam
to visualize the features extracted by ResNet-50 trained with the baseline, AutoAugment(AA) and PAA, as shown in Figure 8. Many examples show that PAA can make the network focus on the most important or representative regions in the image.
In this section, we use Grad-Cam to observe the change of heatmap when changes. If we set as a small number, the network is affected by other things without paying attention to the target object. When is too small, the role of patch regularizer cannot be reflected. When is too large, the network’s attention diverges and becomes chaotic. In theory, the larger the , the richer the regularization. But in fact, there are too many combinations between patches, resulting in a large search space that hurts generalization. It is difficult to find an optimal augmentation policy. Some examples are shown in Figure 7. When equals to 4, the heat map is concentrated in the most discriminative areas.