Neural networks are prone to overfit when the labeled data is limited. Regularization is one of the key components to prevent overfitting in the training of deep neural network. Data augmentation serves as a type of regularization when training neural networks, and it can greatly reduce the change of overfitting. By generating artificial training data via label-preserving transformations of existing training samples, data augmentation maintains the ability to increase both the amount and diversity of data [1, 2, 3]. Recently, AutoAugment 
has been proposed to automatic search for better data augmentation approach that can incorporate invariance and generalize well across different models and datasets. AutoAugment enriches the diversity of each policy by introducing a probability and magnitude for each operation. And it treats the problem of finding the best augmentation policy as a discrete search task and achieves state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVNH, and ImageNet (without additional data). However, the probability and magnitude are divided in a discrete space, which may lead to a sub-optimal solution. The discrete search task is a common problem in the field of reinforcement learning. The search algorithm (an RNN controller and the Proximal Policy Optimization algorithm) chosen by AutoAugment is one of many available search algorithms to find the potential policies. However, the final policy found by AutoAugment may have some disadvantages because it contains some sub-policies which are rarely applied in practice as the probability of an operation is too small. So we suppose that the rarely-used sub-policies could be substituted to get better performance if better search algorithms can be deployed.
Recently, an Augmented Random Search method (ARS)  has shown its immense potential in dealing with continuous control problems. In particular, ARS has been proved to match or exceed state-of-the-art sample efficiency on MuJoCo locomotion benchmarks [6, 7]. In concrete, it is 15 times more computationally efficient than evolution strategy (ES) , which is the fastest competing method. Taking advantage of the high computational efficiency of ARS, we can explore the large policy space more adequately over many random seeds and different choices of hyper-parameters. Naturally, with the aim of finding better-augmented policies, we explore to apply ARS to the policy search problem.
In more detail, we aim to substitute the discrete search space with a continuous space while maintaining the efficiency of the search procedure. To achieve this goal, we first apply a sigmoid function to normalize the output. Then, the normalized output is divided into three categories: operation, probability, and magnitude. In the implementation, each policy expresses more accurate states than those generated by Autoaugment because of the continuous policy space. With the proposed search approach, state-of-the-art accuracies have been achieved on the datasets including CIFAR-10, CIFAR-100, and ImageNet (without additional data). On CIFAR-10, we achieve an error rate of 1.26, which is 0.22 better than the state-of-the-art . On CIFAR-100, we improve the accuracy of AutoAugment from 10.67 to 10.24. On ImageNet, we achieve a Top-1 accuracy of 83.88.
The followings are organized as: Section 2 describes the relationship between our work and previous related work. Section 3 presents the proposed approach, while Section 4 presents the quantitative comparison results. The conclusion is drawn in Section 5.
2 Releted work
Data augmentation is widely-used for visual recognition task, while the policies augmentation are designed manually. Presently, the popular data augmentation approaches include: (1) geometric transformation, such as scale, shifting, rotation, flip, affine transformation (such as elastic distortions 
). (2) Sample crop and interpolation, such as random Eraser, Mixup . (3) Sample synthetic using Generative Adversarial Neural Networks . In , the author proposed to deploy the augmentation in feature space. All of these manual augmentation techniques are a form of label-preserving data augmentation, which relied on the heavy interaction from the knowledge expert. AutoAugment is proposed in  to find the policies from the data in an automatic manner. This paper aims to improve the performance of AutoAugment by substituting the discrete search space with the continuous one.
We follow the policy definition of AutoAugment: each sub-policy consisting of two image operations to be applied in sequence; each operation is also associated with two hyper-parameters: 1) the probability of applying the operation, and 2) the magnitude of the operation . The problem of augmentation policy search can be formulated as a continuous search problem. Notably, the operations are applied in the specified order and the search space for the two hyper-parameters is continuous, which guarantees a diverse sample search processing.
, number of directions sampled per iteration N, standard deviation of the exploration noise, number of top-performing directions to use .
We denote our method as ARS-Aug, which build on successful heuristics employed in deep reinforcement learning. The idea of ARS-Aug is to search the best policy directly on the sphere in parameter space. In concrete, it collects the reward signals on a series of directions in parameter space and then optimizes the step along each direction to form the best policy. The reward signal is obtained from the generalization accuracy of a “child model” (a small neural network which decreases the training time). The data parallel training across multiple GPUs using Ray is exploited to speed up the “child model” training process and collect more reward signals. For optimizing the steps on each direction to form the final best policy, ARS-Aug updates each perturbation direction by the difference of the rewards and . This function quantifies the step to move in a certain direction. In addition, we improve the updating process by discarding the computation of the update steps on the directions that yield the least improvement of the reward. This mechanism can guarantee that the update steps are an average over directions that obtained high rewards.
In order to collect the reward signals, we need to transfer the output of a policy to the data augmentation policy. The output is generated by disturbance on the policy and a sigmoid function for normalizing. The output
is a 30-dimensional vector, which needs to be transferred to five sub-polices, each with two operations, and each operation requiring an operation type, magnitude, and probability. For details of the transferring process, we first split the 30-dimensional vector into ten 3-dimensional vectors. Then, for each 3-dimensional vector, the three dimensions stand for operation type (), probability () and magnitude () respectively. For the operation type, we discrete the output space by dividing the interval [0,1] into 16 parts, and then map the value to the identifier of the sub-interval . The possibility of each operation is directly represented as the second dimension . Similarly, the magnitude of the operation is transferred as .
4 Experimental Results
In this section, we evaluate the performance of ARS-Aug on the CIFAR-10 , CIFAR-100 , and ImageNet  datasets. In our experiments, ARS-Aug is implemented with a parallel version using the Python library Ray . A shared noise table storing independent standard normal entries is used in order to avoid the computational bottleneck of communicating perturbations . This will guarantee that the workers can communicate indices through the shared noise table. We also set the random seeds for the generators of all the workers. The random seeds are distinct from each other to get a diverse sample efficiency. We repeated the training process 100 times with different random seeds and hyper-parameters for a thorough search over the policy space. The random seeds are sampled uniformly from the interval [0,1000) and are then fixed.
4.1 CIFAR-10 and CIFAR-100 Results
As CIFAR-10 and CIFAR-100 have a close data distribution, we aim to find an augmentation policy which can suit for both the two datasets. Considering that the search process needs to take the child model’s accuracy as the reward signal, we establish a reduced CIFAR-10 dataset to decrease the training time, 4000 examples are randomly selected to generate the smaller dataset to decrease the training time. However, the validation process is stochastic due to the process of randomly choosing sub-policies and each operation’s applied probability. To find a suitable number of epochs per sub-policy for ARS-Aug to be effective, we conduct a series of experiments to fix the most approximate value. We find that 120 epochs are suitable for ARS-Aug to train the “child model” with five sub-policies. In other words, the training time can make the models fully benefit from all of the sub-policies. In addition, we also fix the training epochs for the datasets (e.g., 1800 epochs for Shake-Shake on CIFAR-10, and 270 epochs for ResNet-50 on ImageNet).
We now introduce the details for ARS-Aug to find the best augmentation policy. For the child model architecture, we use small Wide-ResNet-40-2 model, and train for 120 epochs. The use of a small Wide-ResNet is for computational efficiency. We follow the experimental settings : a weight decay of , a learning rate of , and a cosine learning decay with one annealing cycle.
It is worthwhile to note that: in order to make full use of the advantages of augmented policies, the augmented policy is applied in addition to the standard baseline pre-processing: on one image, we first apply the baseline augmentation provided by the existing baseline methods, then apply the augmented policy, then apply cutout.
On CIFAR-10, the operations of the policies found by ARS-Aug have no main difference with those of AutoAugment. However, the probability of each operation has been optimized since there does not exist the values which are close to zero. For example, the ”Invert” operation does not appear in the concatenated policy, which is different from that of AutoAugment. This will make room for more meaningful operations and increase the diversity of the whole policy. In addition, the value of magnitude is more accurate (two decimal places) than that of AutoAugment (one decimal), which gives a more precise measurement of the influence brought by each operation.
The importance of diversity in Augmented policies has been demonstrated in AutoAugment. The hypothesis that more sub-policies will improve the generalization accuracy has been validated , and the validation accuracy improves with more sub-policies up to about 25 sub-policies. Therefore, we concatenate 25 sub-policies and form a single policy to train on the full datasets.
We now show the advantage of policies found by ARS-Aug on CIFAR-10. We choose six neural network architectures to make a quantitative comparison with AutoAugment. In order to guarantee a fair comparison, we first find the most approximate hyper-parameters for weight decay, and learning rate that give the best validation set accuracy with baseline augmentation. All the other implemented details are the same as reported in the papers which introduce the corresponding models [17, 18, 19]. As shown from Table 1, the test set accuracies of ARS-Aug beat AutoAugment on all the test models. Additionally, we achieve an error rate of 1.26 with the ShakeDrop  model, which is 0.22 better than AutoAugment.
|Shake-Shake (26 232d)||2.47||2.14|
|Shake-Shake (26 296d)||1.99||1.68|
|Shake-Shake (26 2112d)||1.89||1.59|
|PyramidNet + ShakeDrop||1.48||1.26|
We also train three models on CIFAR-100 with the same policy found by ARS-Aug on reduced CIFAR-10; The results are shown in Table 2. Taking advantage of the sampling efficiency of ARS-Aug, it outperforms AutoAugment on the error rate.
|Shake-Shake (26 296d)||14.28||13.86|
4.2 ImageNet Results
Similarly to the above experiments, we apply the same method on ImageNet to find the best-augmented policies. All the implemented details follow those of AutoAugment. The best policies found on ImageNet mainly focus on color-based and Rotation transformation, which have some similarity with those found on CIFAR-10. We then concatenate the best five sub-policies for ImageNet training with the same details of AutoAugment. From Table 3, we can see that the accuracies on all the models are optimized. To our best knowledge, there only exists a better result of Top-1 error rate , which takes advantage of a large amount of weakly labeled extra data. An illustrative example of selected policies from ARS-Aug is visualized in Fig. 1.
We have proposed an augmented random search method ARS-Aug for searching better data augmentation policies compared with AutoAugment. The discrete search space of AutoAugment has been changed to a continuous space to improve the searching performance. By fully taking advantage of the higher sample efficiency of ARS, ARS-Aug can find better policies for data augmentation and achieve state-of-the-art accuracy on CIFAR-10, CIFAR-100, and ImageNet (without additional data). Our work still has some limitations. For example, the datasets we select are limited to the vision domain. Therefore, we consider our future work to apply our automatic augmentation approach to the audio/speech domain and try other search strategies to improve the performance and efficiency.
This work is partially supported by the National Natural Science Foundation of China (nos. 61751208)). The authors would like to acknowledge useful discussions and helpful suggestions from Zhifeng Gao (Microsoft Asia).
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
-  P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” in International Conference on Document Analysis and Recognition, 2003, p. 958.
-  H. S. Baird, H. Bunke, and K. Yamamoto, Structured document image analysis. Springer Science & Business Media, 2012.
-  E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation policies from data,” arXiv preprint arXiv:1805.09501, 2018.
-  H. Mania, A. Guy, and B. Recht, “Simple random search provides a competitive approach to reinforcement learning,” arXiv preprint arXiv:1803.07055, 2018.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
-  E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 5026–5033.
-  T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolution strategies as a scalable alternative to reinforcement learning,” arXiv preprint arXiv:1703.03864, 2017.
-  Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” arXiv preprint arXiv:1708.04896, 2017.
-  H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  T. DeVries and G. W. Taylor, “Dataset augmentation in feature space,” arXiv preprint arXiv:1702.05538, 2017.
-  P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging ai applications,” arXiv preprint arXiv:1712.05889, 2017.
-  A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.
-  I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
-  S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
-  X. Gastaldi, “Shake-shake regularization,” arXiv preprint arXiv:1705.07485, 2017.
-  Y. Yamada, M. Iwamura, and K. Kise, “Shakedrop regularization,” arXiv preprint arXiv:1802.02375, 2018.
-  D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, “Exploring the limits of weakly supervised pretraining,” arXiv preprint arXiv:1805.00932, 2018.