Deep neural nets are powerful machine learning systems that tend to work well when trained on massive amounts of data. Data augmentation is an effective technique to increase both the amount and diversity of data by randomly “augmenting” it[3, 54, 29]; in the image domain, common augmentations include translating the image by a few pixels, or flipping the image horizontally. Intuitively, data augmentation is used to teach a model about invariances in the data domain: classifying an object is often insensitive to horizontal flips or translation. Network architectures can also be used to hardcode invariances: convolutional networks bake in translation invariance [16, 32, 25, 29]. However, using data augmentation to incorporate potential invariances can be easier than hardcoding invariances into the model architecture directly.
|Dataset||GPU||Best published||Our results|
for more detailed comparison. GPU hours are estimated for an NVIDIA Tesla P100.
Yet a large focus of the machine learning and computer vision community has been to engineer better network architectures (e.g.,[55, 59, 20, 58, 64, 19, 72, 23, 48]). Less attention has been paid to finding better data augmentation methods that incorporate more invariances. For instance, on ImageNet, the data augmentation approach by , introduced in 2012, remains the standard with small changes. Even when augmentation improvements have been found for a particular dataset, they often do not transfer to other datasets as effectively. For example, horizontal flipping of images during training is an effective data augmentation method on CIFAR-10, but not on MNIST, due to the different symmetries present in these datasets. The need for automatically learned data-augmentation has been raised recently as an important unsolved problem .
In this paper, we aim to automate the process of finding an effective data augmentation policy for a target dataset. In our implementation (Section 3
), each policy expresses several choices and orders of possible augmentation operations, where each operation is an image processing function (e.g., translation, rotation, or color normalization), the probabilities of applying the function, and the magnitudes with which they are applied. We use a search algorithm to find the best choices and orders of these operations such that training a neural network yields the best validation accuracy. In our experiments, we use Reinforcement Learning as the search algorithm, but we believe the results can be further improved if better algorithms are used [48, 39].
Our extensive experiments show that AutoAugment achieves excellent improvements in two use cases: 1) AutoAugment can be applied directly on the dataset of interest to find the best augmentation policy (AutoAugment-direct) and 2) learned policies can be transferred to new datasets (AutoAugment-transfer). Firstly, for direct application, our method achieves state-of-the-art accuracy on datasets such as CIFAR-10, reduced CIFAR-10, CIFAR-100, SVHN, reduced SVHN, and ImageNet (without additional data). On CIFAR-10, we achieve an error rate of 1.5%, which is 0.6% better than the previous state-of-the-art . On SVHN, we improve the state-of-the-art error rate from 1.3%  to 1.0%. On reduced datasets, our method achieves performance comparable to semi-supervised methods without using any unlabeled data. On ImageNet, we achieve a Top-1 accuracy of 83.5% which is 0.4% better than the previous record of 83.1%. Secondly, if direct application is too expensive, transferring an augmentation policy can be a good alternative. For transferring an augmentation policy, we show that policies found on one task can generalize well across different models and datasets. For example, the policy found on ImageNet leads to significant improvements on a variety of FGVC datasets. Even on datasets for which fine-tuning weights pre-trained on ImageNet does not help significantly , e.g. Stanford Cars  and FGVC Aircraft 
, training with the ImageNet policy reduces test set error by 1.2% and 1.8%, respectively. This result suggests that transferring data augmentation policies offers an alternative method for standard weight transfer learning. A summary of our results is shown in Table1.
2 Related Work
Common data augmentation methods for image recognition have been designed manually and the best augmentation strategies are dataset-specific. For example, on MNIST, most top-ranked models use elastic distortions, scale, translation, and rotation [54, 8, 62, 52]. On natural image datasets, such as CIFAR-10 and ImageNet, random cropping, image mirroring and color shifting / whitening are more common . As these methods are designed manually, they require expert knowledge and time. Our approach of learning data augmentation policies from data in principle can be used for any dataset, not just one.
This paper introduces an automated approach to find data augmentation policies from data. Our approach is inspired by recent advances in architecture search, where reinforcement learning and evolution have been used to discover model architectures from data [71, 4, 72, 7, 35, 13, 34, 46, 49, 63, 48, 9]. Although these methods have improved upon human-designed architectures, it has not been possible to beat the 2% error-rate barrier on CIFAR-10 using architecture search alone.
Previous attempts at learned data augmentations include Smart Augmentation, which proposed a network that automatically generates augmented data by merging two or more samples from the same class . Tran et al. used a Bayesian approach to generate data based on the distribution learned from the training set . DeVries and Taylor used simple transformations in the learned feature space to augment data .
Generative adversarial networks have also been used for the purpose of generating additional data (e.g., [45, 41, 70, 2, 56]). The key difference between our method and generative models is that our method generates symbolic transformation operations, whereas generative models, such as GANs, generate the augmented data directly. An exception is work by Ratner et al., who used GANs to generate sequences that describe data augmentation strategies .
3 AutoAugment: Searching for best Augmentation policies Directly on the Dataset of Interest
We formulate the problem of finding the best augmentation policy as a discrete search problem (see Figure 1). Our method consists of two components: A search algorithm and a search space. At a high level, the search algorithm (implemented as a controller RNN) samples a data augmentation policy , which has information about what image processing operation to use, the probability of using the operation in each batch, and the magnitude of the operation. Key to our method is the fact that the policy will be used to train a neural network with a fixed architecture, whose validation accuracy will be sent back to update the controller. Since is not differentiable, the controller will be updated by policy gradient methods. In the following section we will describe the two components in detail.
Search space details:
In our search space, a policy consists of 5 sub-policies with each sub-policy consisting of two image operations to be applied in sequence. Additionally, each operation is also associated with two hyperparameters: 1) the probability of applying the operation, and 2) the magnitude of the operation.
Figure 2 shows an example of a policy with 5-sub-policies in our search space. The first sub-policy specifies a sequential application of ShearX followed by Invert. The probability of applying ShearX is 0.9, and when applied, has a magnitude of 7 out of 10. We then apply Invert with probability of 0.8. The Invert operation does not use the magnitude information. We emphasize that these operations are applied in the specified order.
The operations we used in our experiments are from PIL, a popular Python image library.111https://pillow.readthedocs.io/en/5.1.x/ For generality, we considered all functions in PIL that accept an image as input and output an image. We additionally used two other promising augmentation techniques: Cutout  and SamplePairing . The operations we searched over are ShearX/Y, TranslateX/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Cutout , Sample Pairing .222Details about these operations are listed in Table 1 in the Appendix. In total, we have 16 operations in our search space. Each operation also comes with a default range of magnitudes, which will be described in more detail in Section 4. We discretize the range of magnitudes into 10 values (uniform spacing) so that we can use a discrete search algorithm to find them. Similarly, we also discretize the probability of applying that operation into 11 values (uniform spacing). Finding each sub-policy becomes a search problem in a space of possibilities. Our goal, however, is to find 5 such sub-policies concurrently in order to increase diversity. The search space with 5 sub-policies then has roughly possibilities.
The 16 operations we used and their default range of values are shown in Table 1 in the Appendix. Notice that there is no explicit “Identity” operation in our search space; this operation is implicit, and can be achieved by calling an operation with probability set to be 0.
. The search algorithm has two components: a controller, which is a recurrent neural network, and the training algorithm, which is the Proximal Policy Optimization algorithm. At each step, the controller predicts a decision produced by a softmax; the prediction is then fed into the next step as an embedding. In total the controller has 30 softmax predictions in order to predict 5 sub-policies, each with 2 operations, and each operation requiring an operation type, magnitude and probability.
The training of controller RNN: The controller is trained with a reward signal, which is how good the policy is in improving the generalization of a “child model” (a neural network trained as part of the search process). In our experiments, we set aside a validation set to measure the generalization of a child model. A child model is trained with augmented data generated by applying the 5 sub-policies on the training set (that does not contain the validation set). For each example in the mini-batch, one of the 5 sub-policies is chosen randomly to augment the image. The child model is then evaluated on the validation set to measure the accuracy, which is used as the reward signal to train the recurrent network controller. On each dataset, the controller samples about 15,000 policies.
Architecture of controller RNN and training hyperparameters: We follow the training procedure and hyperparameters from  for training the controller. More concretely, the controller RNN is a one-layer LSTM  with 100 hidden units at each layer and B softmax predictions for the two convolutional cells (where B is typically 5) associated with each architecture decision. Each of the 10B predictions of the controller RNN is associated with a probability. The joint probability of a child network is the product of all probabilities at these 10B softmaxes. This joint probability is used to compute the gradient for the controller RNN. The gradient is scaled by the validation accuracy of the child network to update the controller RNN such that the controller assigns low probabilities for bad child networks and high probabilities for good child networks. Similar to , we employ Proximal Policy Optimization (PPO)  with learning rate 0.00035. To encourage exploration we also use an entropy penalty with a weight of 0.00001. In our implementation, the baseline function is an exponential moving average of previous rewards with a weight of 0.95. The weights of the controller are initialized uniformly between -0.1 and 0.1. We choose to train the controller using PPO out of convenience, although prior work had shown that other methods (e.g. augmented random search and evolutionary strategies) can perform as well or even slightly better .
At the end of the search, we concatenate the sub-policies from the best 5 policies into a single policy (with 25 sub-policies). This final policy with 25 sub-policies is used to train the models for each dataset.
4 Experiments and Results
Summary of Experiments.
In this section, we empirically investigate the performance of AutoAugment in two use cases: AutoAugment-direct and AutoAugment-transfer. First, we will benchmark AutoAugment with direct search for best augmentation policies on highly competitive datasets: CIFAR-10 , CIFAR-100 , SVHN  (Section 4.1), and ImageNet  (Section 4.2) datasets. Our results show that a direct application of AutoAugment improves significantly the baseline models and produces state-of-the-art accuracies on these challenging datasets. Next, we will study the transferability of augmentation policies between datasets. More concretely, we will transfer the best augmentation policies found on ImageNet to fine-grained classification datasets such as Oxford 102 Flowers, Caltech-101, Oxford-IIIT Pets, FGVC Aircraft, Stanford Cars (Section 4.3). Our results also show that augmentation policies are surprisingly transferable and yield significant improvements on strong baseline models on these datasets. Finally, in Section 5, we will compare AutoAugment against other automated data augmentation methods and show that AutoAugment is significantly better.
4.1 CIFAR-10, CIFAR-100, SVHN Results
Although CIFAR-10 has 50,000 training examples, we perform the search for the best policies on a smaller dataset we call “reduced CIFAR-10”, which consists of 4,000 randomly chosen examples, to save time for training child models during the augmentation search process (We find that the resulting policies do not seem to be sensitive to this number). We find that for a fixed amount of training time, it is more useful to allow child models to train for more epochs rather than train for fewer epochs with more training data. For the child model architecture we use small Wide-ResNet-40-2 (40 layers - widening factor of 2) model, and train for 120 epochs. The use of a small Wide-ResNet is for computational efficiency as each child model is trained from scratch to compute the gradient update for the controller. We use a weight decay of 10, learning rate of 0.01, and a cosine learning decay with one annealing cycle .
The policies found during the search on reduced CIFAR-10 are later used to train final models on CIFAR-10, reduced CIFAR-10, and CIFAR-100. As mentioned above, we concatenate sub-policies from the best 5 policies to form a single policy with 25 sub-policies, which is used for all of AutoAugment experiments on the CIFAR datasets.
The baseline pre-processing follows the convention for state-of-the-art CIFAR-10 models: standardizing the data, using horizontal flips with 50% probability, zero-padding and random crops, and finally Cutout with 16x16 pixels[17, 65, 48, 72]. The AutoAugment policy is applied in addition to the standard baseline pre-processing: on one image, we first apply the baseline augmentation provided by the existing baseline methods, then apply the AutoAugment policy, then apply Cutout. We did not optimize the Cutout region size, and use the suggested value of 16 pixels . Note that since Cutout is an operation in the search space, Cutout may be used twice on the same image: the first time with learned region size, and the second time with fixed region size. In practice, as the probability of the Cutout operation in the first application is small, Cutout is often used once on a given image.
On CIFAR-10, AutoAugment picks mostly color-based transformations. For example, the most commonly picked transformations on CIFAR-10 are Equalize, AutoContrast, Color, and Brightness (refer to Table 1 in the Appendix for their descriptions). Geometric transformations like ShearX and ShearY are rarely found in good policies. Furthermore, the transformation Invert is almost never applied in a successful policy. The policy found on CIFAR-10 is included in the Appendix. Below, we describe our results on the CIFAR datasets using the policy found on reduced CIFAR-10. All of the reported results are averaged over 5 runs.
models in TensorFlow, and find the weight decay and learning rate hyperparameters that give the best validation set accuracy for regular training with baseline augmentation. Other hyperparameters are the same as reported in the papers introducing the models [67, 17, 65], with the exception of using a cosine learning decay for the Wide-ResNet-28-10. We then use the same model and hyperparameters to evaluate the test set accuracy of AutoAugment. For AmoebaNets, we use the same hyperparameters that were used in  for both baseline augmentation and AutoAugment. As can be seen from the table, we achieve an error rate of 1.5% with the ShakeDrop  model, which is 0.6% better than the state-of-the-art . Notice that this gain is much larger than the previous gains obtained by AmoebaNet-B against ShakeDrop (+0.2%), and by ShakeDrop against Shake-Shake (+0.2%). Ref.  reports an improvement of 1.1% for a Wide-ResNet-28-10 model trained on CIFAR-10.
|Shake-Shake (26 2x32d) ||3.6||3.0||2.5|
|Shake-Shake (26 2x96d) ||2.9||2.6||2.0|
|Shake-Shake (26 2x112d) ||2.8||2.6||1.9|
|AmoebaNet-B (6,128) ||3.0||2.1||1.8|
|Reduced CIFAR-10||Wide-ResNet-28-10 ||18.8||16.5||14.1|
|Shake-Shake (26 2x96d) ||17.1||13.4|
|Shake-Shake (26 2x96d) ||17.1||16.0||14.3|
|Shake-Shake (26 2x96d) ||1.4||1.2||1.0|
|Reduced SVHN||Wide-ResNet-28-10 ||13.2||32.5||8.2|
|Shake-Shake (26 2x96d) ||12.3||24.2||5.9|
We also evaluate our best model trained with AutoAugment on a recently proposed CIFAR-10 test set . Recht et al.  report that Shake-Shake (26 2x64d) + Cutout performs best on this new dataset, with an error rate of 7.0% (4.1% higher relative to error rate on the original CIFAR-10 test set). Furthermore, PyramidNet+ShakeDrop achieves an error rate of 7.7% on the new dataset (4.6% higher relative to the original test set). Our best model, PyramidNet+ShakeDrop trained with AutoAugment achieves an error rate of 4.4% (2.9% higher than the error rate on the original set). Compared to other models evaluated on this new dataset, our model exhibits a significantly smaller drop in accuracy.
We also train models on CIFAR-100 with the same AutoAugment policy found on reduced-CIFAR-10; results are shown in Table 2. Again, we achieve the state-of-art result on this dataset, beating the previous record of 12.19% error rate by ShakeDrop regularization .
Finally, we apply the same AutoAugment policy to train models on reduced CIFAR-10 (the same 4,000 example training set that we use to find the best policy). Similar to the experimental convention used by the semi-supervised learning community[60, 40, 51, 31, 44] we train on 4,000 labeled samples. But we do not use the 46,000 unlabeled samples during training. Our results shown in Table 2. We note that the improvement in accuracy due to AutoAugment is more significant on the reduced dataset compared to the full dataset. As the size of the training set grows, we expect that the effect of data-augmentation will be reduced. However, in the next sections we show that even for larger datasets like SVHN and ImageNet, AutoAugment can still lead to improvements in generalization accuracy.
We experimented with the SVHN dataset , which has 73,257 training examples (also called “core training set”), and 531,131 additional training examples. The test set has 26,032 examples. To save time during the search, we created a reduced SVHN dataset of 1,000 examples sampled randomly from the core training set. We use AutoAugment to find the best policies. The model architecture and training procedure of the child models are identical to the above experiments with CIFAR-10.
The policies picked on SVHN are different than the transformations picked on CIFAR-10. For example, the most commonly picked transformations on SVHN are Invert, Equalize, ShearX/Y, and Rotate. As mentioned above, the transformation Invert is almost never used on CIFAR-10, yet it is very common in successful SVHN policies. Intuitively, this makes sense since the specific color of numbers is not as important as the relative color of the number and its background. Furthermore, geometric transformations ShearX/Y are two of the most popular transformations on SVHN. This also can be understood by general properties of images in SVHN: house numbers are often naturally sheared and skewed in the dataset, so it is helpful to learn the invariance to such transformations via data augmentation. Five successful sub-policies are visualized on SVHN examples in Figure2.
After the end of the search, we concatenate the 5 best policies and apply them to train architectures that already perform well on SVHN using standard augmentation policies. For full training, we follow the common procedure mentioned in the Wide-ResNet paper  of using the core training set and the extra data. The validation set is constructed by setting aside the last 7325 samples of the training set. We tune the weight decay and learning rate on the validation set performance. Other hyperparameters and training details are identical to the those in the papers introducing the models [67, 17]. One exception is that we trained the Shake-Shake model only for 160 epochs (as opposed to 1,800), due to the large size of the full SVHN dataset. Baseline pre-processing involves standardizing the data and applying Cutout with a region size of 20x20 pixels, following the procedure outlined in . AutoAugment results combine the baseline pre-processing with the policy learned on SVHN. One exception is that we do not use Cutout on reduced SVHN as it lowers the accuracy significantly. The summary of the results in this experiment are shown in Table 2. As can be seen from the table, we achieve state-of-the-art accuracy using both models.
We also test the best policies on reduced SVHN (the same 1,000 example training set where the best policies are found). AutoAugment results on the reduced set are again comparable to the leading semi-supervised methods, which range from 5.42% to 3.86% . (see Table 2). We see again that AutoAugment leads to more significant improvements on the reduced dataset than the full dataset.
4.2 ImageNet Results
Similar to above experiments, we use a reduced subset of the ImageNet training set, with 120 classes (randomly chosen) and 6,000 samples, to search for policies. We train a Wide-ResNet 40-2 using cosine decay for 200 epochs. A weight decay of was used along with a learning rate of 0.1. The best policies found on ImageNet are similar to those found on CIFAR-10, focusing on color-based transformations. One difference is that a geometric transformation, Rotate, is commonly used on ImageNet policies. One of the best policies is visualized in Figure 3.
Again, we combine the 5 best policies for a total of 25 sub-policies to create the final policy for ImageNet training. We then train on the full ImageNet from scratch with this policy using the ResNet-50 and ResNet-200 models for 270 epochs. We use a batch size of 4096 and a learning rate of 1.6. We decay the learning rate by 10-fold at epochs 90, 180, and 240. For baseline augmentation, we use the standard Inception-style pre-processing which involves scaling pixel values to [-1,1], horizontal flips with 50% probability, and random distortions of colors [22, 59]. For models trained with AutoAugment, we use the baseline pre-processing and the policy learned on ImageNet. We find that removing the random distortions of color does not change the results for AutoAugment.
|ResNet-50||76.3 / 93.1||77.6 / 93.8|
|ResNet-200||78.5 / 94.2||80.0 / 95.0|
|AmoebaNet-B (6,190)||82.2 / 96.0||82.8 / 96.2|
|AmoebaNet-C (6,228)||83.1 / 96.1||83.5 / 96.5|
Our ImageNet results are shown in Table 3. As can be seen from the results, AutoAugment improves over the widely-used Inception Pre-processing  across a wide range of models, from ResNet-50 to the state-of-art AmoebaNets . Secondly, applying AutoAugment to AmoebaNet-C improves its top-1 and top-5 accuracy from 83.1% / 96.1% to 83.5% / 96.5%. This improvement is remarkable given that the best augmentation policy was discovered on 5,000 images. We expect the results to be even better when more compute is available so that AutoAugment can use more images to discover even better augmentation policies. The accuracy of 83.5% / 96.5% is also the new state-of-art top-1/top-5 accuracy on this dataset (without multicrop / ensembling).
4.3 The Transferability of Learned Augmentation policies to Other Datasets
In the above, we applied AutoAugment directly to find augmentation policies on the dataset of interest (AutoAugment-direct). In many cases, such application of AutoAugment can be resource-intensive. Here we seek to understand if it is possible to transfer augmentation policies from one dataset to another (which we call AutoAugment-transfer). If such transfer happens naturally, the resource requirements won’t be as intensive as applying AutoAugment directly. Also if such transfer happens naturally, we also have clear evidence that AutoAugment does not “overfit” to the dataset of interest and that AutoAugment indeed finds generic transformations that can be applied to all kinds of problems.
To evaluate the transferability of the policy found on ImageNet, we use the same policy that is learned on ImageNet (and used for the results on Table 3) on five FGVC datasets with image size similar to ImageNet. These datasets are challenging as they have relatively small sets of training examples while having a large number of classes.
For all of the datasets listed in Table 4, we train a Inception v4  for 1,000 epochs, using a cosine learning rate decay with one annealing cycle. The learning rate and weight decay are chosen based on the validation set performance. We then combine the training set and the validation set and train again with the chosen hyperparameters. The image size is set to 448x448 pixels. The policies found on ImageNet improve the generalization accuracy of all of the FGVC datasets significantly. To the best of our knowledge, our result on the Stanford Cars dataset is the lowest error rate achieved on this dataset although we train the network weights from scratch. Previous state-of-the-art fine-tuned pre-trained weights on ImageNet and used deep layer aggregation to attain a 5.9% error rate .
In this section, we compare our search to previous attempts at automated data augmentation methods. We also discuss the dependence of our results on some of the design decisions we have made through several ablation experiments.
AutoAugment vs. other automated data augmentation methods: Most notable amongst many previous data augmentation methods is the work of . The setup in  is similar to GANs : a generator learns to propose augmentation policy (a sequence of image processing operations) such that the augmented images can fool a discriminator. The difference of our method to theirs is that our method tries to optimize classification accuracy directly whereas their method just tries to make sure the augmented images are similar to the current training images.
To make the comparison fair, we carried out experiments similar to that described in . We trained a ResNet-32 and a ResNet-56 using the same policy from Section 4.1, to compare our method to the results from . By training a ResNet-32 with Baseline data augmentation, we achieve the same error as  did with ResNet-56 (called Heur. in ). For this reason, we trained both a ResNet-32 and a ResNet-56. We show that for both models, AutoAugment leads to higher improvement (3.0%).
Relation between training steps and number of sub-policies: An important aspect of our work is the stochastic application of sub-policies during training. Every image is only augmented by one of the many sub-policies available in each mini-batch, which itself has further stochasticity since each transformation has a probability of application associated with it. We find that this stochasticity requires a certain number of epochs per sub-policy for AutoAugment to be effective. Since the child models are each trained with 5 sub-policies, they need to be trained for more than 80-100 epochs before the model can fully benefit from all of the sub-policies. This is the reason we choose to train our child models for 120 epochs. Each sub-policy needs to be applied a certain number of times before the model benefits from it. After the policy is learned, the full model is trained for longer (e.g. 1800 epochs for Shake-Shake on CIFAR-10, and 270 epochs for ResNet-50 on ImageNet), which allows us to use more sub-policies.
Transferability across datasets and architectures: It is important to note that the policies described above transfer well to many model architectures and datasets. For example, the policy learned on Wide-ResNet-40-2 and reduced CIFAR-10 leads to the improvements described on all of the other model architectures trained on full CIFAR-10 and CIFAR-100. Similarly, a policy learned on Wide-ResNet-40-2 and reduced ImageNet leads to significant improvements on Inception v4 trained on FGVC datasets that have different data and class distributions. AutoAugment policies are never found to hurt the performance of models even if they are learned on a different dataset, which is not the case for Cutout on reduced SVHN (Table 2). We present the best policy on ImageNet and SVHN in the Appendix, which can hopefully help researchers improve their generalization accuracy on relevant image classification tasks.
Despite the observed transferability, we find that policies learned on data distributions closest to the target yield the best performance: when training on SVHN, using the best policy learned on reduced CIFAR-10 does slightly improve generalization accuracy compared to the baseline augmentation, but not as significantly as applying the SVHN-learned policy.
5.1 Ablation experiments
Changing the number of sub-policies: Our hypothesis is that as we increase the number of sub-policies, the neural network is trained on the same points with a greater diversity of augmentation, which should increase the generalization accuracy. To test this hypothesis, we investigate the average validation accuracy of fully-trained Wide-ResNet-28-10 models on CIFAR-10 as a function of the number of sub-policies used in training. We randomly select sub-policy sets from a pool of 500 good sub-policies, and train the Wide-ResNet-28-10 model for 200 epochs with each of these sub-policy sets. For each set size, we sampled sub-policies five different times for better statistics. The training details of the model are the same as above for Wide-ResNet-28-10 trained on CIFAR-10. Figure 4 shows the average validation set accuracy as a function of the number of sub-policies used in training, confirming that the validation accuracy improves with more sub-policies up to about 20 sub-policies.
Randomizing the probabilities and magnitudes in the augmentation policy: We take the AutoAugment policy on CIFAR-10 and randomize the probabilities and magnitudes of each operation in it. We train a Wide-ResNet-28-10 
, using the same training procedure as before, for 20 different instances of the randomized probabilities and magnitudes. We find the average error to be 3.0% (with a standard deviation of 0.1%), which is 0.4% worse than the result achieved with the original AutoAugment policy (see Table2).
Performance of random policies: Next, we randomize the whole policy, the operations as well as the probabilities and magnitudes. Averaged over 20 runs, this experiment yields an average accuracy of 3.1% (with a standard deviation of 0.1%), which is slightly worse than randomizing only the probabilities and magnitudes. The best random policy achieves achieves an error of 3.0% (when average over 5 independent runs). This shows that even AutoAugment with randomly sampled policy leads to appreciable improvements.
The ablation experiments indicate that even data augmentation policies that are randomly sampled from our search space can lead to improvements on CIFAR-10 over the baseline augmentation policy. However, the improvements exhibited by random policies are less than those shown by the AutoAugment policy ( vs. error rate). Furthermore, the probability and magnitude information learned within the AutoAugment policy seem to be important, as its effectiveness is reduced significantly when those parameters are randomized. We emphasize again that we trained our controller using RL out of convenience, augmented random search and evolutionary strategies can be used just as well. The main contribution of this paper is in our approach to data augmentation and in the construction of the search space; not in discrete optimization methodology.
We thank Alok Aggarwal, Gabriel Bender, Yanping Huang, Pieter-Jan Kindermans, Simon Kornblith, Augustus Odena, Avital Oliver, Colin Raffel, and Jonathan Shlens for helpful discussions. This work was done as part of the Google Brain Residency program (g.co/brainresidency).
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016. USENIX Association.
-  A. Antoniou, A. Storkey, and H. Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017.
-  H. S. Baird. Document image defect models. In Structured Document Image Analysis, pages 546–556. Springer, 1992.
-  B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations, 2017.
-  I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le. Neural optimizer search with reinforcement learning. In International Conference on Machine Learning, 2017.
-  J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
-  A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Smash: one-shot model architecture search through hypernetworks. In International Conference on Learning Representations, 2017.
D. Ciregan, U. Meier, and J. Schmidhuber.
Multi-column deep neural networks for image classification.
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3642–3649. IEEE, 2012.
-  E. D. Cubuk, B. Zoph, S. S. Schoenholz, and Q. V. Le. Intriguing properties of adversarial examples. arXiv preprint arXiv:1711.02846, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
-  T. DeVries and G. W. Taylor. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538, 2017.
-  T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
-  T. Elsken, J.-H. Metzen, and F. Hutter. Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528, 2017.
Y. Em, F. Gag, Y. Lou, S. Wang, T. Huang, and L.-Y. Duan.
Incorporating intra-class variance to fine-grained visual recognition.In Multimedia and Expo (ICME), 2017 IEEE International Conference on, pages 1452–1457. IEEE, 2017.
-  L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer vision and Image understanding, 106(1):59–70, 2007.
-  K. Fukushima and S. Miyake. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pages 267–285. Springer, 1982.
-  X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6307–6315. IEEE, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  A. G. Howard. Some improvements on deep convolutional neural network based image classification. arXiv preprint arXiv:1312.5402, 2013.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017.
-  H. Inoue. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929, 2018.
-  K. Jarrett, K. Kavukcuoglu, Y. LeCun, et al. What is the best multi-stage architecture for object recognition? In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2146–2153. IEEE, 2009.
-  S. Kornblith, J. Shlens, and Q. V. Le. Do better imagenet models transfer better? arXiv preprint arXiv:1805.08974, 2018.
-  J. Krause, J. Deng, M. Stark, and L. Fei-Fei. Collecting a large-scale dataset of fine-grained cars. In Second Workshop on Fine-Grained Visual Categorization, 2013.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.In Advances in Neural Information Processing Systems, 2012.
-  M. Kumar, G. E. Dahl, V. Vasudevan, and M. Norouzi. Parallel architecture and hyperparameter search via successive halving and classification. arXiv preprint arXiv:1805.10255, 2018.
-  S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  J. Lemley, S. Bazrafkan, and P. Corcoran. Smart augmentation learning an optimal data augmentation strategy. IEEE Access, 5:5858–5869, 2017.
-  C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017.
-  H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. Hierarchical representations for efficient architecture search. In International Conference on Learning Representations, 2018.
-  I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
-  D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.
-  S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
-  H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
-  T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. In International Conference on Learning Representations, 2016.
-  S. Mun, S. Park, D. K. Han, and H. Ko. Generative adversarial network based acoustic scene training set augmentation and selection using svm hyper-plane. In Detection and Classification of Acoustic Scenes and Events Workshop, 2017.
Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng.
Reading digits in natural images with unsupervised feature learning.
NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
-  M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on, pages 722–729. IEEE, 2008.
-  A. Oliver, A. Odena, C. Raffel, E. D. Cubuk, and I. J. Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. arXiv preprint arXiv:1804.09170, 2018.
-  L. Perez and J. Wang. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
-  H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, 2018.
-  A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Ré. Learning to compose domain-specific transformations for data augmentation. In Advances in Neural Information Processing Systems, pages 3239–3249, 2017.
-  E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
-  E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin. Large-scale evolution of image classifiers. In International Conference on Machine Learning, 2017.
-  B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018.
-  M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems, pages 1163–1171, 2016.
-  I. Sato, H. Nishimura, and K. Yokoi. Apac: Augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229, 2015.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
-  P. Y. Simard, D. Steinkraus, J. C. Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of International Conference on Document Analysis and Recognition, 2003.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Advances in Neural Information Processing Systems, 2015.
-  L. Sixt, B. Wild, and T. Landgraf. Rendergan: Generating realistic labeled data. arXiv preprint arXiv:1611.01331, 2016.
-  I. Sutskever, J. Schulman, T. Salimans, and D. Kingma. Requests For Research 2.0. https://blog.openai.com/requests-for-research-2, 2018.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
-  T. Tran, T. Pham, G. Carneiro, L. Palmer, and I. Reid. A bayesian data augmentation approach for learning deep models. In Advances in Neural Information Processing Systems, pages 2794–2803, 2017.
-  L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pages 1058–1066, 2013.
-  L. Xie and A. Yuille. Genetic CNN. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987–5995, 2017.
-  Y. Yamada, M. Iwamura, and K. Kise. Shakedrop regularization. arXiv preprint arXiv:1802.02375, 2018.
-  F. Yu, D. Wang, and T. Darrell. Deep layer aggregation. arXiv preprint arXiv:1707.06484, 2017.
-  S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.
-  H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
-  Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.
-  X. Zhu, Y. Liu, Z. Qin, and J. Li. Data augmentation in emotion classification using generative adversarial networks. arXiv preprint arXiv:1711.00648, 2017.
-  B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.
-  B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
Appendix A Supplementary materials for “AutoAugment: Learning Augmentation policies from Data”
|Operation Name||Description||Range of|
|ShearX(Y)||Shear the image along the horizontal (vertical) axis with rate magnitude.||[-0.3,0.3]|
|TranslateX(Y)||Translate the image in the horizontal (vertical) direction by magnitude number of pixels.||[-150,150]|
|Rotate||Rotate the image magnitude degrees.||[-30,30]|
|AutoContrast||Maximize the the image contrast, by making the darkest pixel black and lightest pixel white.|
|Invert||Invert the pixels of the image.|
|Equalize||Equalize the image histogram.|
|Solarize||Invert all pixels above a threshold value of magnitude.||[0,256]|
|Posterize||Reduce the number of bits for each pixel to magnitude bits.||[4,8]|
|Contrast||Control the contrast of the image. A magnitude=0 gives a gray image, whereas magnitude=1 gives the original image.||[0.1,1.9]|
|Color||Adjust the color balance of the image, in a manner similar to the controls on a colour TV set. A magnitude=0 gives a black & white image, whereas magnitude=1 gives the original image.||[0.1,1.9]|
|Brightness||Adjust the brightness of the image. A magnitude=0 gives a black image, whereas magnitude=1 gives the original image.||[0.1,1.9]|
|Sharpness||Adjust the sharpness of the image. A magnitude=0 gives a blurred image, whereas magnitude=1 gives the original image.||[0.1,1.9]|
|Cutout [12, 69]||Set a random square patch of side-length magnitude pixels to gray.||[0,60]|
|Sample Pairing [24, 68]||Linearly add the image with another image (selected at random from the same mini-batch) with weight magnitude, without changing the label.||[0, 0.4]|
|Operation 1||Operation 2|
|Operation 1||Operation 2|
|Operation 1||Operation 2|