Patch AutoAugment

03/20/2021 ∙ by Shiqi Lin, et al. ∙ USTC 15

Data augmentation (DA) plays a critical role in training deep neural networks for improving the generalization of models. Recent work has shown that automatic DA policy, such as AutoAugment (AA), significantly improves model performance. However, most automatic DA methods search for DA policies at the image-level without considering that the optimal policies for different regions in an image may be diverse. In this paper, we propose a patch-level automatic DA algorithm called Patch AutoAugment (PAA). PAA divides an image into a grid of patches and searches for the optimal DA policy of each patch. Specifically, PAA allows each patch DA operation to be controlled by an agent and models it as a Multi-Agent Reinforcement Learning (MARL) problem. At each step, PAA samples the most effective operation for each patch based on its content and the semantics of the whole image. The agents cooperate as a team and share a unified team reward for achieving the joint optimal DA policy of the whole image. The experiment shows that PAA consistently improves the target network performance on many benchmark datasets of image classification and fine-grained image recognition. PAA also achieves remarkable computational efficiency, i.e 2.3x faster than FastAA and 56.1x faster than AA on ImageNet.



There are no comments yet.


page 1

page 7

page 8

page 12

page 13

page 14

Code Repositories


PyTorch implementation of PatchAutoAugment

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data Augmentation (DA) is an important technique to reduce overfitting risk by increasing the variety of training data. Simple augmentation methods, such as rotation and horizontal flipping, have been proved effective in many vision tasks including image classification [28, 10, 8], object detection [33, 44], etc.

Recently, AutoAugment [7] and its variants [7, 23, 31, 59, 49]

have drawn much attention for its superior performance compared to human heuristic DA methods. They aim to automatically find effective data augmentation policies, thus greatly reducing the burden of tuning many hyperparameters in DA (e.g. operation, magnitude and probability). The pioneer work AutoAugment (AA)

[7] first uses reinforcement learning to automatically search for the optimal DA policy and provides significant performance improvements. But the computational cost and time complexity of AA remain high. Many studies [31, 23, 20] use hyperparameter optimization or a differentiable policy search pipeline on the basis of AA to reduce the costs and obtain results comparable to AA. Adversarial AA (AdvAA) [59] and OHL-Auto-Aug [32] are online methods that reduce a lot of searching time. However, these approaches search for policies at the image-level, but ignore that the optimal policies of different regions in an image may be diverse, such as foreground object and background.

Some works [17, 50, 52] heuristically explore diverse augmentations on different regions in the form of patches. But the performance of these manually designed methods is usually not as good as automatic DA methods.

Figure 1: Illustration of different automatic augmentation policy. (a) Original images; (b) Results of image-level AutoAugment such as AA [7] and AdvAA [59]; (c) Our patch-level AutoAugment results.

In this paper, we explore patch-level AutoAugment that automatically searches for the optimal augmentation policies for the patches of an image and propose Patch AutoAugment (PAA). In PAA, an image is divided into a grid of patches (Refer to Figure 1) and each patch augmentation operation is controlled by an agent (See Figure 2). We regard the search for the joint optimal policy of patches in an image as a fully cooperative multi-agent task. PAA uses Multi-Agent Reinforcement Learning (MARL) algorithm to solve the problem. At each step, the policy network outputs augmentation operation of each patch according to the content of each patch and the semantics of the entire image. The agents cooperate with each other to achieve the optimal augmentation effect of the entire image by sharing a team reward. Augmentation operations are performed on patches with a certain probability and magnitude. The augmented images are fed into the target network for training. Inspired by adversarial training [59, 49], the training loss of target network serves as the reward signal for updating the augmentation network. Through adversarial training and MARL optimization, PAA provides meaningful regularization for target network and improves the overall performance.

Our contributions can be summarized as follows:

  • To the best of our knowledge, we are the first to study patch-level automatic augmentation policy. Our method can search for the optimal policy for each patch according to its content and the semantics of image.

  • We model the search for joint optimal DA policy of patches in an image as a fully cooperative multi-agent task and use MARL algorithm to solve the problem.

  • The experimental results show that PAA has a significant performance improvement at a relatively low computational cost. Visualization results also demonstrate that we can provide some instructive suggestions on which operation to choose for patches with different content, which is dynamic during training.

Figure 2: The framework of our PAA. PAA divides the input image into a grid of patches. Each patch augmentation is controlled by an agent. PAA uses Multi-Agent Reinforcement Learning (MARL) to search for the joint optimal policy of patches according to the patch content and the whole image semantics. PAA is co-trained with a target network through adversarial training.

2 Related Work

2.1 Automatic Data Augmentation

Data augmentation (DA) is designed to increase the diversity of data to reduce overfitting [44]. Manually designed methods require additional expertise. Therefore, many studies have explored automatic data augmentation methods in order to overcome the performance limitations of human heuristics caused by randomness. Smart Augmentation [29]

uses a network to generate augmented data by merging pairs of samples from the same classes. Generative adversarial networks create augmented data directly

[40, 45, 37, 61].

Several current studies aim to automate the process of finding the optimal combination of predefined transformation functions, which is different from the automatic generation methods. AutoAugment (AA) [7] uses Reinforcement Learning (RL) to train RNN controller to search the best augmentation policy, which has information about what image processing operation to use, the probability of using the operation and the magnitude. PBA [23] employs hyperparameter optimization to reduce the ultra-high computational cost in AA. Fast AutoAugment (FastAA) [31] uses density matching to reduce complexity and obtains results comparable to AA. Faster AutoAugment [20] proposes a differentiable policy search pipeline to further reduce the costs. Adversarial AA [59] transfers from the proxy tasks to the target tasks and proposes an adversarial framework to jointly optimize the target network and the augmentation network. OHL-Auto-Aug [32] proposes a new online data augmentation scheme based on the augmentation network co-trained with the target network.

2.2 Heuristic Region-based Data Augmentation

Many region-based data augmentation methods [11, 60, 41, 17, 52, 50] have been widely used. Some early pioneering DA methods randomly remove [11, 60, 41] or replace [55, 47] patches, such as CutOut [11] and CutMix [55]. Some recent works believe that regions with different content are performed augmentations in different ways rather than randomly selecting regions. KeepAugment [17] uses saliency-map to select important regions to keep untouched. Attentive CutMix [52] utilizes the attention maps to select attentive patches to be replaced. Using saliency detection, SaliencyMix [50] restricts the replaced patch to the objects of interest. However, they are all human-crafted DA methods and usually have worse performance than automatic DA methods.

2.3 Multi-Agent Reinforcement Learning

Distinct from applying Reinforcement Learning (RL) algorithm directly to multi-agent systems, the most significant characteristic of Multi-Agent Reinforcement Learning (MARL) is the cooperation between agents [48, 35, 14]. Due to the limited observation and action of a single agent, cooperation is necessary in the reinforced multi-agent system to achieve the common goal. Compared with independent agents, cooperative agents can improve the efficiency and robustness of the model [38, 58, 1]. Many vision tasks use MARL to interact with the public environment to make decisions, with the goal of maximizing the expected total return of all agents, such as image segmentation [30, 19, 34], image processing [15].

3 Patch AutoAugment

As above mentioned, the images are divided into patches and it’s necessary to achieve joint optimal augmentation through the search for optimal augmentation policies for patches. It’s obvious that patches in one image are correlated to each to some extent, therefore, we adopt MARL algorithm to solve this cooperative multi-agent task. We get the joint optimal policy of the whole image through the cooperation of agents in MARL. In this section, we introduce Patch AutoAugment (PAA) in detail. We first present the specific formulation of MARL. Then we introduce algorithms and training process. Finally, we analyze PAA from a theoretical perspective and discuss the search complexity.

3.1 MARL Formulation

As shown in Figure 2, we divide the image into equal-sized, non-overlapping small patches. is the -th patch in image . Then, we treat the search for the optimal augmentation policies for patches in an image to achieve the joint optimal augmentation policy of the whole image as a fully cooperative multi-agent task, which can be described as Dec-POMDP [39] defined by a tuple . The augmentation of each patch is controlled by an agent. There are agents.

State. The state describes the situation of the entire environment. The state contains the observation information of all agents. In our MARL model, we use a backbone (such as ResNet-18 [22]) to extract feature from the whole image as the state.

Observation. The observation of agent is the visible part of the state, which is unique. Each agent draws its individual observation, namely, the observation of the -th agent is the feature of the patch.

Action. is the action of the -th agent. is the pre-defined action set that consists of fifteen common image processing function: Shear, RandomErasing, MixUp, CutMix, Rotate, Contrast, Invert, Equalize, Gray, Posterize, Contrast, Color, Brightness, Sharpness and Dropout. The details can be found in Appendix. Each agent chooses an action , which is the transformation performed on the patch. The joint action is actually an action map formed by the operations of patches.

Reward. Reward penalizes the agent which selects the augmentation operations without gain. Similar to [59, 32], we utilize adversarial learning to guide the joint training of the augmentation network and the target network. PAA expects to locate the weakness of the target network. Specifically, the augmentation network hopes to increase the training loss of the target network. Through adversarial games, the policy networks attempt to select suitable samples. Therefore, we consider using the training loss of the target network as reward:


Where and denote inputs and labels in supervision tasks. is the augmentation network, is the target network. In classification task, is the cross-entropy loss. In our PAA model, all agents share a unified team reward for cooperation to make the joint augmentation policy better.

Policy. The policy of each patch is denoted as . It describes the probability of selecting action in state and observation . The objective of PAA is to learn the joint optimal policy that aims to maximize the total expected reward. PAA allows the policy networks of all agents to share the same parameters, which is a common method in MARL for communication between agents [21, 5, 6].

3.2 Algorithm

Policy Network. The augmentation network consists of all policy networks of patches. In this paper, we adopt the Advantage Actor Critic (A2C) [36, 4] for the PAA problem and extend it to the fully convolutional form. A2C is built on the Actor-Critic method, which has two networks. The actor network outputs the policy (probability through softmax) of taking and then obtains the action map. The critic network evaluates the value function of the current state. is called the advantage function and the gradient for is computed as follows:



is the learning rate of the augmentation network. By updating policy networks with the policy gradient, we use critics to guide actors to choose those augmentation policies that maximize expected reward. The loss of the critic network is the square loss of the actual reward and the estimated state value. And the gradient

is computed as follows:


Execution Probability and Magnitude. Each operation has two associated parameters: the probability of applying the operation on the patch and the magnitude. We observe that over-hard samples might have a negative impact on network convergence. As the name suggests, hard sample refers to samples that are difficult to learn (with large loss). If all patches execute the selected operations, the sample would be too hard. Therefore, we introduce as the probability. The execution flag

is sampled from the Bernoulli distribution

. means to transform the patch and means to keep the original. Due to the impact of the large search space on performance, we exclude probability and magnitude from the policy. We discuss in detail in Section 3.3.

Inference and Training. As shown in Figure 2, PAA firstly divides an image into a grid of patches, and each operation of them is controlled by an agent. We use a backbone to extract features from the whole image as the state. Each agent draws its individual observation, namely, the feature of patch. According to the state (global) and observation (local), the actor networks output the operations (constitute an action map) of patches. The augmented mini-batch is processed with a certain probability and magnitude and then input to the target classification network for parameters update. The training loss of the target network is fed back to the critic network to update the policy. The pseudo codes of training PAA is shown in Algorithm 1.

Dataset Model Baseline CutOut [11] AA [7] FastAA [31] PAA
CIFAR-10 Wide-ResNet-28-10 96.13 96.92 97.32 97.21 97.43
ShakeShake(26 2x32d) 96.26 96.54 96.68 96.60 96.86
ShakeShake(26 2x96d) 97.06 97.44 97.68 97.69 97.74
ShakeShake(26 2x112d) 97.01 97.43 97.78 97.74 97.62
PyramidNet+ShakeDrop 97.24 97.69 98.07 98.15 98.23
CIFAR-100 Wide-ResNet-28-10 81.20 82.52 82.72 82.52 82.74
ShakeShake(26 2x96d) 82.35 83.01 84.06 84.02 84.11
PyramidNet+ShakeDrop 83.77 85.57 86.02 85.00 85.80
Table 1: The validation top-1 accuracy () on CIFAR-10 [28] and CIFAR-100 [28] with different methods. All models are trained from scratch. The results of AdvAA [59] are not reported due to unavailability of official source.
1:Initial augmentation model (backbone , actor network , critic network ); initial target model ; training set ;
2:while not converged do
3:     Sample input and label from
4:     Divide into patches
5:     for  do
6:         Sample global state
7:         Sample observation
8:         Sample policy
9:         Sample action
10:         Sample execution probability
11:         Sample execution flag
12:         Augment patch
13:     end for
14:     Augmented
15:     Update target model by
16:     Sample reward
17:     Sample state value
18:     Calculate advantages
19:     Update actor network by
20:     Update critic network by
21:end while
Algorithm 1 Patch AutoAugment

3.3 Discussion

Theoretical Motivation. We try to analyze the motivation of PAA from the theory of risk minimization. In Vicinal Risk Minimization (VRM) principle [2], data augmentation constructs adjacent values of training samples to prevent the target network from memorizing the training data. We have a set of training data and need to find the function describes the relationship between input and label . In the light of VRM, we construct a dataset sampling from the vicinal distribution, and aim to minimize the empirical vicinal risk:


Using a more granular augmentation, PAA covers the vicinal distributions that cannot be included by the image-level augmentation. PAA proposes a more generic vicinal distribution:


Where is the patch of the image , and are the augmented forms of the patch and the image, respectively. There are a total of operation functions, and is the among them.

Search Complexity. The previous automatic search policy [7, 31] contains: operation, probability and magnitude. AdvAA [59] and OHL-Auto-Aug [32] directly remove probability from policy to reduce the search space. Our PAA further reduces the search space by applying MARL to only control operation. The probability and magnitude in PAA are artificially designed inspired by [9, 23, 59, 32]. Similar to [9], we also choose a constant magnitude range for each operation. We discuss how to choose in Section 4.8. We show that a reasonable reduction of the search space is beneficial to reduce the time cost and improve the performance of the target network in Section 4.7.

4 Experiments

GPU hours AA [7] FastAA [31] AdvAA [59] PAA
CIFAR-10 Search 5000 3.5 0 0

Train 6 6 - 7.5
Total 5006 9.5 - 7.5
ImageNet Search 15000 450 0 0
Train 160 160 1280 270
Total 15160 610 1280 270
Table 2: We train Wide-ResNet-28-10 on CIFAR-10 [28] and ResNet-50 on ImageNet [10]. Search: the time of searching for augmentation policys. Train: the time of training the target network. Total: the total time of searching for augmentation policy and training the target network. The searching time of AdvAA [59] and our PAA is close to zero. The computational cost of PAA is estimated on GeForce GTX 1080 Tis while AA [7], AdvAA [59] are on NVIDIA Tesla P100 and FastAA [31] is on NVIDIA Tesla V100.
Method Baseline AA [7] PAA
ResNet-50 76.28 / 93.18 77.01 / 93.42 77.35 / 93.82
ResNet-152 78.31 / 93.98 78.63 / 94.23 78.92 / 94.46

Table 3: Validation top-1 / top-5 average accuracy (

) on ImageNet. The standard deviation of PAA results is less than 0.15 (

). All models are trained from scratch. Following the previous work, we compare our PAA with baseline method and AA [7].
Dataset Model Baseline AA PAA
CUB-200 ResNet-50 83.89 82.59 84.36
-2011 [51] ResNet-152 86.71 85.27 86.83
Stanford ResNet-50 83.48 83.81 84.34
-Dogs [25] ResNet-152 85.26 85.23 85.97
Stanford ResNet-50 87.41 88.02 88.57
-Cars [27] ResNet-152 89.72 90.24 90.61
Table 4: We apply our PAA to the fine-grained image recognition tasks and show the validation top-1 accuracy ().

4.1 Experiment Overview

Datasets. We evaluate Patch AutoAugment (PAA) on the following datasets: CIFAR-10 [28], CIFAR-100 [28], ImageNet [10]

and three fine-grained object recognition datasets (CUB-200-2011

[51], Stanford Dogs [25]

and Stanford Cars


Implementation Details.

We implement our model on PyTorch, and train it on the NVIDIA 1080Ti GPUs. Compared with the previous methods using image-by-image sequential transformations

[7, 31, 11]

, we speed up the process by performing parallel transformations on tensor. Specifically, we pick out the patches that perform the same operation and put them into a new tensor. Tensor transformations on GPU can be realized by Kornia


Kornia is a differentiable computer vision library for PyTorch. We use it to accelerate augmentation operation on tensors.


. And the positions of patches in image are recorded to restore a new augmented batch. Parallel processing significantly reduces the computational cost. The policy network is a multi-layer convolutional neural network (CNN) used in all experiments.

We follow the previous work [31, 59, 7] and set the same training schedule. More details are supplied in Appendix. For each experiment in our paper, we run it four times.

4.2 CIFAR-10 and CIFAR-100

Both CIFAR-10 and CIFAR-100 have 50,000 training examples. Each image of size belongs to one of 10 categories. We compare Patch AutoAugment (PAA) with baseline, CutOut [11], AutoAugment (AA)[7], Fast AutoAugment (FastAA)[31]. The results of Adversarial AutoAugment (AdvAA) [59] are not reported due to unavailability of official source. Following [7, 31, 59], we evaluate PAA using Wide-ResNet-28-10 [56], Shake-Shake (26 2x32d)[16] , Shake-Shake (26 2x96d)[16] , Shake-Shake (26 2x112d) [16], and Pyramid-Net+ShakeDrop [18, 53]. Each model is trained from scratch.

Experiment Setting. The baseline follows the convention for state-of-the-art (SOTA) CIFAR-10 models [42, 62, 53, 16]

: standardizing the data, horizontally flipping with 0.5 probability, zero-padding and random cropping with

. And we compare our PAA with CutOut [11], which sets the pixels of randomly selected patch in an image as zero. For PAA, due to the small size of images in CIFAR-10 and CIFAR-100, we set the number of patches to 2. And the execution probability

is sampled from the standard uniform distribution

. In the ablation study, we discuss the design of in depth.

Result and Analysis. The means and standard deviations (std) of experimental results are shown in Table 1. We achieve the best performance compared with the SOTA methods on most networks. Specifically, we achieve 97.43 on the Wide-ResNet-28-10 model, which is 0.11 better than the SOTA AA method and 1.3 better than baseline. Furthermore, PAA achieves accuracy of 96.86 on the ShakeShake(26 2x32d), which is 0.18 higher than AA. We also compare our PAA with other SOTA methods on CIFAR-100, which is shown in Table 1

. We use T-test

[26] to perform statistical significance tests. PAA has a significant improvement compared to AA on most models, with a 95 confidence level.

The image size and the number of patches on CIFAR-10 and CIFAR-100 are too small to highlight the effect of PAA. But PAA still improves performance of the target network with less computational cost.

4.3 ImageNet

ImageNet dataset has about 1.2 million training images and 50,000 validation images with 1000 classes. We compare our Patch AutoAugment (PAA) with baseline, AutoAugment (AA)[7]. In this section, we evaluate our methods on ResNet-50 [22], and ResNet-152 [22], which are trained from scratch.

Experiment Setting. For the baseline augmentation [24, 46], we randomly resize the input image and crop it to , then horizontally flip it with a probability of 0.5, and standard the data. We set in PAA, and the execution probability is sampled from .

Result and Analysis. The average (std) of the experimental results on ImageNet are presented in Table 3. It can be noted that PAA achieves a top-1 accuracy of 77.35 and a top-5 accuracy of 93.82 on ResNet-50. Compared with the current state-of-the-art AA, PAA achieves a significant top-1 accuracy improvement of 0.34. In addition, the PAA performance on the ResNet-152 model still maintains a reasonable level. PAA can achieve a top-1 accuracy improvement of 0.29 compared with AA. The results of statistical significance tests (with 95 confidence level) [26] also show that PAA has a significant improvement.

4.4 PAA for Fine-grained Image Recognition

We evaluate the performance of our proposed PAA on three standard fine-grained object recognition datasets. CUB-200-2011 [51] consists of 6,000 train and 5,800 test bird images distributed in 200 categories. Stanford Dogs [25] consists of 20,500 images of 120 breeds of dogs. Stanford Cars [27] contains 16,185 images in 196 classes. The image size in the above datasets is . According to previous work [12, 3], we use pretrained ResNet-50 and ResNet-152 models.

Experiment Setting. We use the same baseline and AA settings as the experimental settings on ImageNet. We set for PAA. We linearly increase the probability

from 0.5 to 1 as the number of epoch increases.

Result and Analysis. As shown in Table 4, the performance of PAA is consistently better than other methods on ResNet-50 and ResNet-152. AA does not exceed the baseline on CUB-200-2011, which is not shown in [7]. On ResNet-50, compared to the baseline, the performance of PAA on CUB-200-2011, Stanford Dogs and Stanford Cars has increased by 0.47, 0.86, and 1.16 respectively. Applying PAA to ResNet-152 on Stanford Dogs improves its validation accuracy from 85.26 to 85.97. Our results show that PAA significantly exceeds the baseline and AA. We also use T-test [26] (with 95 confidence level) to show that there is a statistically significant difference between PAA and AA. Our model achieves the SOTA accuracy on these challenging fine-grained tasks.

4.5 Computational Cost

We estimate the policy searching time and training time of PAA on CIFAR-10 and ImageNet. We compare PAA with AA [7], FastAA [31] and AdvAA [59]. The computational cost is mainly used for searching policies and training the target network. As shown in Table 2, compared to the previous work, PAA has the lowest total computational cost, and the searching time is almost negligible.

There are three main reasons from our perspective: First, similar to [59, 32], we jointly optimize the augmentation network and the target network. Online augmentation policy saves most searching time. Second, PAA reasonably reduces the search space as discussed in Section 3.3. We use MARL to only control the operation, which simplifies the model to accelerate the network convergence to reduce training time. The total number of PAA model parameters (about 0.23M) is less than 1 of the target network (ResNet50: about 25.5M). Thirdly, we augment tensors consist of the patches with the same operations, thereby achieving parallel processing. Compared to other online methods, PAA further reduces the training time of the target network. PAA achieves the SOTA on the total computational cost.

4.6 Visualization

In this section, we further illustrate the effectiveness of the proposed PAA scheme by visualizing the feature map and the percentage of each operation.

Feature Visualization. To intuitively show the impact of PAA on the features learned by CNN, we use GradCam [43] to visualize the focus of ResNet-50 trained with baseline, AA[7] and PAA, as shown in Figure 3. After adding baseline or AA, the target network focuses on other parts rather than the object itself. For example, in Figure 3, we find that the network is concentrated on the branch where the bird stands, and treats the irrelevant branch as part of the bird. However, these parts are biases [54] in the data itself. The feature map responses of PAA concentrate on the part of the object, such as birds’ beak. The feature visualization results show that PAA enables the classification network to truly focus on areas that are more discriminative. More visualization results are provided in Appendix.

Policy Visualization. In addition, we discuss the guiding significance of PAA for different types of patches. We divide patches into four levels according to the average patch importance calculated by GradCam [43]: (1) most important (2) important (3) normal (4) not important. We collect the operation statistics for each class of patches, and draw a stacked area chart showing the percentages of operations over time. We take the training process of ResNet-50 on CUB-200-2011 as an example, as shown in Figure 4.

We find that at the beginning of the training process, the selected actions are messy since the MARL network is in the exploratory stage. In the middle of the training process, different types of patches have their own policies that prefer to select certain special operations. At the tail end of the training, the adversarial game between the augmentation network and the target network has reached a balance, so the percentages of all operations are almost the same.

Figure 4 also shows that for the patches that the network cares most about, RandomErasing and CutMix account for the majority of augmentation operations. The second and third important patches tend to select Shear and Dropout respectively. The patches least noticed by the network prefer color transformations. Therefore, we draw the conclusion that color transformation is mostly picked for patches in the background. Geometric transformations, such as RandomErasing and Shear, are chosen for the important patches where the object is located.

4.7 Ablation Study

Model 0.1 0.3 0.5 0.7 0.9 0.51 10.5

WRN-28-10 97.43 96.98 96.99 97.15 97.31 96.87 97.07 97.08
CUB-200-2011 ResNet-50 82.92 82.45 83.28 82.54 82.12 81.56 84.36 83.21
Table 5: Test top-1 accuracy () on CIFAR-10 with Wide-ResNet-28-10 trained from scratch and CUB-200-2011 [51] for pretrained ResNet-50. Higher is better. The probability is designed in three ways, (1) represents that samples from a standard uniform distribution (2) is a fixed value (3) changes linearly. 0.51 means that linearly increases from 0.5 to 1 during the training process. In other words, the number of augmented patches in an image increases linearly, and the sample becomes harder.

To better explain the effectiveness of MARL in our PAA, we perform ablation studies on CIFAR-10 and CIFAR-100 with Wide-ResNet-28-10, and ImageNet with ResNet-50 respectively, as shown in Table 6. The parameter settings are consistent with the previous settings. MARL refers to putting the variable into the policy and searching with MARL.

MARL vs. Random. Random refers to sampling a value of probability from a uniform distribution or randomly selecting an operation from the set of transformations with the same probability. (1) Comparing Patch RandomAugment (PRA: and are all Random) with PAA, we believe that MARL plays a guiding role in selecting operations and improves network performance more than blindly selecting augmentation operations. (2) The result of Patch RandomAutoAugment (PRAA: is Random but is MARL) shows that automatic control of operations is more important than that of probability. (3) The performance of PatchProbability AutoAugment (PPAA: and are all MARL) drops significantly. We analyze the possible reason for the performance degradation is that searching for probability and operation causes the search space to be too large to find the optimal augmentation policy.

MARL vs. SARL. SARL represents that the single-agent reinforcement learning algorithm is directly applied to the multi-agent system, and each agent is independent of each other. That is, SARL simply inputs patches into image-level automatic DA methods. The experimental results show that the cooperation between patches is crucial to achieve the optimal augmentation policy of the whole image, which is composed of all optimal patches’ policies, so that the target network can achieve better performance.

Method Operation Probability Dataset
random MARL SARL random MARL CIFAR-10 CIFAR-100 ImageNet
PRA 96.95 80.42 75.31
PRAA 96.99 81.21 76.50
PPAA 97.01 82.14 76.98
PAA(w/o cooperation) 97.25 82.35 77.12
PAA(Ours) 97.43 82.66 77.35
Table 6: We evaluate Patch RandomAugment (PRA), Patch RandomAutoAugment (PRAA) and PatchProbability AutoAugment (PPAA) and show the top-1 accuracy () on CIFAR-10 and CIFAR-100 with Wide-ResNet-28-10, and ImageNet with ResNet-50. MARL means using MARL to select actions. Random refers to sampling from or randomly selecting an operation from the set of transformations with the same probability. SARL represents that the single-agent reinforcement learning algorithm (SARL) directly applies to the multi-agent system. That is, patches are directly input to image-level automatic DA methods and processed independently (without cooperation).
N 1 2 4 7 14
CUB-200-2011 82.67 83.75 84.36 83.61 83.85
ImageNet 76.14 76.80 77.35 77.12 76.66
Table 7: The top-1 test accuracy () of ResNet-50 on ImageNet and pretrained ResNet-50 on CUB-200-2011 with different . When , the performance is the optimal.
Figure 3: The visualization of features learned by ResNet-50 trained with baseline, AA [7] and PAA using Grad-Cam [43]. PAA tends to help the target network focus on the discriminative areas.
Figure 4: We divide patches into four levels according to the average patch importance calculated by GradCam [43]: (1) most important (2) important (3) normal (4) not important. We draw the stacked area chart of the operations’ percentages over time. The area represents the percentage of each operation.

4.8 Discussion

In this section, we first discuss two important parameters and give suggestions on how to set them. Then we summarize the importance of PAA and why we automatically search for augmentation policies at the patch-level.

Grid Size. The number of patches is an important parameter. Table 7 shows the validation accuracy with on ImageNet and CUB-200-2011. The image sizes on the two datasets are and . It can be seen that with the increase of , the recognition accuracy first increases and then decreases. On the two datasets, when N = 4, the best performance is achieved.

If we set the number of to be small, PAA fails to effectively use local regularization, so the advantage of PAA would be limited. In particular, when , the image is not divided, which is similar to the previous automatic DA at the image-level. When we set too large, the vicinal area is too large and may be far from the original distribution (see Section 3.3). Some examples are shown in Figure 7.

Figure 5: PAA is applied to some ’bird’ images on CUB-200-2011 with different .

Execution Probability. To demonstrate the influence of execution probability , we conduct experiments by adopting different values of . (1) samples from a standard uniform distribution (2) is a fixed value (3) changes linearly. As shown in Table 5, we empirically design the choice of according to the target network initialization. We find that when the network is trained from scratch, is sampled from the uniform distribution, which works best. When using a pretrained network,

increases linearly and the performance is optimal. We analyze that the pretrained network has a certain classification ability at the beginning, and it is necessary to generate more difficult samples to challenge the classifier.

Why patch-level automatic DA. We automatically search for the optimal policies for patches and achieve the the optimal joint policy of the whole image. It is more effective than simple heuristic region-based DA methods. Compared with image-level automatic DA, PAA can provide more visual modes and cover more vicinal distributions. PAA chooses augmentation policies for different parts of the image and degenerates into image-level automatic DA under certain circumstances. Through analyzing the visualization results, we find that our patch disrupting method makes the target network focus on discriminative areas. For the tasks with high requirements for details, such as fine-grained object detection, our model is more advantageous.

5 Conclusion

In this paper, we propose Patch AutoAugment (PAA), an automatic data augmentation approach at patch-level. Our method adopts Multi-Agent Reinforcement Learning to automatically search for the optimal augmentation policy for each patch. The joint optimal policy of the entire image is optimized through the cooperation of agents. PAA utilizes adversarial training to achieve online training. Extensive experiments demonstrate that PAA can improve the target network performance with low computational cost in many tasks. We also use visualization to further prove that PAA is beneficial for the network to focus on the most important or representative regions within the image. In future work, we will investigate different schemes on dividing different regions, and introduce a probability value for applying augmentation to the whole image.


Appendix A Model Architecture

Layer Actor network Critic network
Feature Generator: ResNet-18
1 ReLU,Conv2D(32,64,3,1,1),BN FC(1568,256)
2 ReLU,Conv2D(64,64,3,1,1),BN ReLU
3 ReLU,Conv2D(64,15,3,2,1),Softmax FC(256,1)
Table 8:

Model Architecture of our PAA. For each convolution layer, we list the input dimension, output dimension, kernel size, stride, and padding. For the fully-connected layer, we provide the input and output dimension.

In this section, we provide the detailed model architecture for each component in our PAA model: Actor network, Critic network, and they share the Feature Generator network. We use the pretrained ResNet-18 (excluding the final avgpool and softmax layer) to extract the feature of image and patch. As shown in Table


Figure 6: Left: We show some patches that take RandomErasing[60], shear, Translate, and Color. RandomErasing is generally selected by patches in discriminative areas such as the bird’s head or feathers. And the patches in the background select Color. Right: We give the augmentation results of whole image. the most important patch (such as birds’head and wings) mostly choose RandomErasing and CutMix [55]. The second important patch (such as birds’ belly and toes) mostly choose DropOut [11], Shear. Most of the unimportant patchs(such as background) choose Gray or Color.
Operation Name Description magnitudes (parameters in Kornia)
Adjust the brightness of the patch. A magnitude=0 gives a
black patch, whereas magnitude=1 gives the original patch.
brightness=(0.5, 0.95)
Control the contrast of the patch. A magnitude=0 gives a gray
patch, whereas magnitude=1 gives the original patch.
contrast=(0.5, 0.95)
CutMix [55]
Replace this patch with another patch
(selected at random from the same mini-batch).
DropOut[11] Set all pixels in this patch to the average value of the patch. -
Gray The patch is transformed to grayscale -
Invert Invert the pixels of the patch -
Linearly add the image with another image (selected at ran-
dom from the same mini-batch) with 0.5, without changing the label.
Posterize Reduce the number of bits for each pixel to magnitude bits. bits=3
RandomErasing[60] Erases a random rectangle of a patch.
scale=(0.09, 0.36),
ratio=(0.5, 1/0.5)
Rotation Rotate the patch magnitude degrees. degrees=30.0
Adjust the sharpness of the image. A magnitude=0 gives a
blurred image, whereas magnitude=1 gives the original image.
Shear the image along the horizontal or vertical axis with rate
shear=(-30, 30)
Translate the patch in the horizontal or vertical direction
by absolute fraction of patch length.
translate=(0.4, 0.4)
Color Adjust the color balance of the image. hue=(-0.3, 0.3)
Equalize Equalize the image histogram. -
Table 9: We list fifteen kinds of augmentation operations that we use. Additionally, the values of magnitude for each operation according to a fixed magnitude schedule are shown in the third column. Some transformations do not use the magnitude information (e.g. CutMix and DropOut etc.).
Figure 7: PAA applied to some CUB-200-2011 ’bird’ images verse different . When , we visualize the augmentation and Grad-Cam[43] results.
Dataset Model BatchSize LR WD LD LRstep LR-A2C Epoch
CIFAR-10 Wide-ResNet-28-10 128 0.1 5e-4 cosine - 1e-3 200
ShakeShake(26 2x32d) 128 0.2 1e-4 cosine - 1e-4 600
ShakeShake(26 2x96d) 128 0.2 1e-4 cosine - 1e-4 600
ShakeShake(26 2x112d) 128 0.2 1e-4 cosine - 1e-4 600
PyramidNet+ShakeDrop 128 0.1 1e-4 cosine - 1e-4 600
CIFAR-100 Wide-ResNet-28-10 128 0.1 5e-4 cosine - 1e-4 200
ShakeShake(26 2x96d) 128 0.1 5e-4 cosine - 1e-4 1200
PyramidNet+ShakeDrop 128 0.5 1e-4 cosine - 1e-4 1200
ImageNet ResNet-50 512 0.1 1e-4 multistep [30,60,90,120,150] 1e-4 270
ResNet-152 512 0.1 1e-4 multistep [30,60,90,120,150] 1e-4 270
CUB-200-2011 ResNet-50(pretrained) 512 1e-3 1e-4 multistep [30,60,90] 1e-4 200
ResNet-152(pretrained) 512 1e-3 1e-4 multistep [30,60,90] 1e-4 200
Stanford Dogs ResNet-50(pretrained) 512 1e-3 1e-4 multistep [30,60,90] 1e-4 200
ResNet-152(pretrained) 512 1e-3 1e-4 multistep [30,60,90] 1e-4 200
Table 10: Model hyperparameters on CIFAR-10,CIFAR-100, ImageNet, CUB-200-2011 and Stanford Dogs. LR represents learning rate of the target network, WD represents weight decay, and LD represents learning rate decay method. If LD is multistep, we decay the learning rate by 10-fold at epochs 30, 60, 90 etc. according to LR-step. LR-A2C represents the learning rate of augmentation model.
Figure 8: We give more visualizations of augmentation and Grad-Cam[43] results of baseline, AA[7] and our PAA. PAA performs rich patch regularizations and tends to help model focus on invariant features.

Appendix B Operations and Details

We list the fifteen common augmentation operations. For a fair comparison with the previous work[7, 23, 31, 59], we use the same augmentation operations. We explained these operations in detail as shown in Table 9. Kornia [13] is a computer vision library for PyTorch. We use it to implement augmentations on tensor. Therefore, in table 9, We also give the parameter values we set for the functions in the Kornia library. These parameters correspond to the magnitude of augmentations. Some operations, such as DropOut and CutMix, have no parameters.

Appendix C Hyperparameters

We detail the classifier and our PAA model hyperparameters on CIFAR-10, CIFAR-100, ImageNet, CUB-200-2011 and Stanford Dogs in Table 10. We do not specifically tune these hyperparameters, and all of these are consistent with previous works [7, 31, 59, 12, 3], except for the number of epochs on ImageNet and fine-grained tasks. In addition, we use SGD optimizer with an initial learning rate 1e-4 to train Actor network and Critic network.

Appendix D Visualization

We devide patches into four categories according to the average importance of patches calculated by Grad-Cam[43]. For the CUB-200-2011 dataset, the most important patches(such as birds’head or wings) prefer to select RandomErasing and Cutmix operations; Next, the second and third important patches(such as birds’ belly or toes) prefer Shear, DropOut[11] and Translate. And not important pathes (such as background) prefer to color transformation operation. We show some patches that choose RandomErasing[60], Shear, Translate, and Color. From Figure 6, the most important patches prefer spatial transformations, while unimportant patches prefer color transformations. It can be seen from the experimental results that our model PAA plays a guiding role in patch selection augmentation policy. Depending on the type of patch, PAA selects different augmentation operations. We use Grad-Cam[43]

to visualize the features extracted by ResNet-50 trained with the baseline, AutoAugment(AA)

[7] and PAA, as shown in Figure 8. Many examples show that PAA can make the network focus on the most important or representative regions in the image.

Appendix E Grid Size

In this section, we use Grad-Cam[43] to observe the change of heatmap when changes. If we set as a small number, the network is affected by other things without paying attention to the target object. When is too small, the role of patch regularizer cannot be reflected. When is too large, the network’s attention diverges and becomes chaotic. In theory, the larger the , the richer the regularization. But in fact, there are too many combinations between patches, resulting in a large search space that hurts generalization. It is difficult to find an optimal augmentation policy. Some examples are shown in Figure 7. When equals to 4, the heat map is concentrated in the most discriminative areas.