1 Introduction
Neural networks are prone to overfit when the labeled data is limited. Regularization is one of the key components to prevent overfitting in the training of deep neural network. Data augmentation serves as a type of regularization when training neural networks, and it can greatly reduce the change of overfitting. By generating artificial training data via labelpreserving transformations of existing training samples, data augmentation maintains the ability to increase both the amount and diversity of data [1, 2, 3]. Recently, AutoAugment [4]
has been proposed to automatic search for better data augmentation approach that can incorporate invariance and generalize well across different models and datasets. AutoAugment enriches the diversity of each policy by introducing a probability and magnitude for each operation. And it treats the problem of finding the best augmentation policy as a discrete search task and achieves stateoftheart accuracy on CIFAR10, CIFAR100, SVNH, and ImageNet (without additional data). However, the probability and magnitude are divided in a discrete space, which may lead to a suboptimal solution. The discrete search task is a common problem in the field of reinforcement learning. The search algorithm (an RNN controller and the Proximal Policy Optimization algorithm) chosen by AutoAugment is one of many available search algorithms to find the potential policies. However, the final policy found by AutoAugment may have some disadvantages because it contains some subpolicies which are rarely applied in practice as the probability of an operation is too small. So we suppose that the rarelyused subpolicies could be substituted to get better performance if better search algorithms can be deployed.
Recently, an Augmented Random Search method (ARS) [5] has shown its immense potential in dealing with continuous control problems. In particular, ARS has been proved to match or exceed stateoftheart sample efficiency on MuJoCo locomotion benchmarks [6, 7]. In concrete, it is 15 times more computationally efficient than evolution strategy (ES) [8], which is the fastest competing method. Taking advantage of the high computational efficiency of ARS, we can explore the large policy space more adequately over many random seeds and different choices of hyperparameters. Naturally, with the aim of finding betteraugmented policies, we explore to apply ARS to the policy search problem.
In more detail, we aim to substitute the discrete search space with a continuous space while maintaining the efficiency of the search procedure. To achieve this goal, we first apply a sigmoid function to normalize the output. Then, the normalized output is divided into three categories: operation, probability, and magnitude. In the implementation, each policy expresses more accurate states than those generated by Autoaugment because of the continuous policy space. With the proposed search approach, stateoftheart accuracies have been achieved on the datasets including CIFAR10, CIFAR100, and ImageNet (without additional data). On CIFAR10, we achieve an error rate of 1.26
, which is 0.22 better than the stateoftheart [4]. On CIFAR100, we improve the accuracy of AutoAugment from 10.67 to 10.24. On ImageNet, we achieve a Top1 accuracy of 83.88.The followings are organized as: Section 2 describes the relationship between our work and previous related work. Section 3 presents the proposed approach, while Section 4 presents the quantitative comparison results. The conclusion is drawn in Section 5.
2 Releted work
Data augmentation is widelyused for visual recognition task, while the policies augmentation are designed manually. Presently, the popular data augmentation approaches include: (1) geometric transformation, such as scale, shifting, rotation, flip, affine transformation (such as elastic distortions [2]
). (2) Sample crop and interpolation, such as random Eraser
[9], Mixup [10]. (3) Sample synthetic using Generative Adversarial Neural Networks [11]. In [12], the author proposed to deploy the augmentation in feature space. All of these manual augmentation techniques are a form of labelpreserving data augmentation, which relied on the heavy interaction from the knowledge expert. AutoAugment is proposed in [4] to find the policies from the data in an automatic manner. This paper aims to improve the performance of AutoAugment by substituting the discrete search space with the continuous one.3 Method
We follow the policy definition of AutoAugment: each subpolicy consisting of two image operations to be applied in sequence; each operation is also associated with two hyperparameters: 1) the probability of applying the operation, and 2) the magnitude of the operation [4]. The problem of augmentation policy search can be formulated as a continuous search problem. Notably, the operations are applied in the specified order and the search space for the two hyperparameters is continuous, which guarantees a diverse sample search processing.
, number of directions sampled per iteration N, standard deviation of the exploration noise
, number of topperforming directions to use .(1) 
(2) 
We denote our method as ARSAug, which build on successful heuristics employed in deep reinforcement learning. The idea of ARSAug is to search the best policy directly on the sphere in parameter space. In concrete, it collects the reward signals on a series of directions in parameter space and then optimizes the step along each direction to form the best policy. The reward signal is obtained from the generalization accuracy of a “child model” (a small neural network which decreases the training time). The data parallel training across multiple GPUs using Ray
[13] is exploited to speed up the “child model” training process and collect more reward signals. For optimizing the steps on each direction to form the final best policy, ARSAug updates each perturbation direction by the difference of the rewards and . This function quantifies the step to move in a certain direction. In addition, we improve the updating process by discarding the computation of the update steps on the directions that yield the least improvement of the reward. This mechanism can guarantee that the update steps are an average over directions that obtained high rewards.In order to collect the reward signals, we need to transfer the output of a policy to the data augmentation policy. The output is generated by disturbance on the policy and a sigmoid function for normalizing. The output
is a 30dimensional vector, which needs to be transferred to five subpolices, each with two operations, and each operation requiring an operation type, magnitude, and probability. For details of the transferring process, we first split the 30dimensional vector into ten 3dimensional vectors. Then, for each 3dimensional vector
, the three dimensions stand for operation type (), probability () and magnitude () respectively. For the operation type, we discrete the output space by dividing the interval [0,1] into 16 parts, and then map the value to the identifier of the subinterval . The possibility of each operation is directly represented as the second dimension . Similarly, the magnitude of the operation is transferred as .4 Experimental Results
In this section, we evaluate the performance of ARSAug on the CIFAR10 [14], CIFAR100 [14], and ImageNet [15] datasets. In our experiments, ARSAug is implemented with a parallel version using the Python library Ray [13]. A shared noise table storing independent standard normal entries is used in order to avoid the computational bottleneck of communicating perturbations . This will guarantee that the workers can communicate indices through the shared noise table. We also set the random seeds for the generators of all the workers. The random seeds are distinct from each other to get a diverse sample efficiency. We repeated the training process 100 times with different random seeds and hyperparameters for a thorough search over the policy space. The random seeds are sampled uniformly from the interval [0,1000) and are then fixed.
4.1 CIFAR10 and CIFAR100 Results
As CIFAR10 and CIFAR100 have a close data distribution, we aim to find an augmentation policy which can suit for both the two datasets. Considering that the search process needs to take the child model’s accuracy as the reward signal, we establish a reduced CIFAR10 dataset to decrease the training time, 4000 examples are randomly selected to generate the smaller dataset to decrease the training time. However, the validation process is stochastic due to the process of randomly choosing subpolicies and each operation’s applied probability. To find a suitable number of epochs per subpolicy for ARSAug to be effective, we conduct a series of experiments to fix the most approximate value. We find that 120 epochs are suitable for ARSAug to train the “child model” with five subpolicies. In other words, the training time can make the models fully benefit from all of the subpolicies. In addition, we also fix the training epochs for the datasets (e.g., 1800 epochs for ShakeShake on CIFAR10, and 270 epochs for ResNet50 on ImageNet).
We now introduce the details for ARSAug to find the best augmentation policy. For the child model architecture, we use small WideResNet402 model, and train for 120 epochs. The use of a small WideResNet is for computational efficiency. We follow the experimental settings [16]: a weight decay of , a learning rate of , and a cosine learning decay with one annealing cycle.
It is worthwhile to note that: in order to make full use of the advantages of augmented policies, the augmented policy is applied in addition to the standard baseline preprocessing: on one image, we first apply the baseline augmentation provided by the existing baseline methods, then apply the augmented policy, then apply cutout.
On CIFAR10, the operations of the policies found by ARSAug have no main difference with those of AutoAugment. However, the probability of each operation has been optimized since there does not exist the values which are close to zero. For example, the ”Invert” operation does not appear in the concatenated policy, which is different from that of AutoAugment. This will make room for more meaningful operations and increase the diversity of the whole policy. In addition, the value of magnitude is more accurate (two decimal places) than that of AutoAugment (one decimal), which gives a more precise measurement of the influence brought by each operation.
The importance of diversity in Augmented policies has been demonstrated in AutoAugment. The hypothesis that more subpolicies will improve the generalization accuracy has been validated [4], and the validation accuracy improves with more subpolicies up to about 25 subpolicies. Therefore, we concatenate 25 subpolicies and form a single policy to train on the full datasets.
We now show the advantage of policies found by ARSAug on CIFAR10. We choose six neural network architectures to make a quantitative comparison with AutoAugment. In order to guarantee a fair comparison, we first find the most approximate hyperparameters for weight decay, and learning rate that give the best validation set accuracy with baseline augmentation. All the other implemented details are the same as reported in the papers which introduce the corresponding models [17, 18, 19]. As shown from Table 1, the test set accuracies of ARSAug beat AutoAugment on all the test models. Additionally, we achieve an error rate of 1.26 with the ShakeDrop [19] model, which is 0.22 better than AutoAugment.
Model  AutoAugment  ARSAug 

WideResNet2810  2.68  2.33 
ShakeShake (26 232d)  2.47  2.14 
ShakeShake (26 296d)  1.99  1.68 
ShakeShake (26 2112d)  1.89  1.59 
AmoebaNetB (6,128)  1.75  1.49 
PyramidNet + ShakeDrop  1.48  1.26 
We also train three models on CIFAR100 with the same policy found by ARSAug on reduced CIFAR10; The results are shown in Table 2. Taking advantage of the sampling efficiency of ARSAug, it outperforms AutoAugment on the error rate.
Model  AutoAugment  ARSAug 

WideResNet2810  17.09  16.64 
ShakeShake (26 296d)  14.28  13.86 
PyramidNet+ShakeDrop  10.67  10.24 
4.2 ImageNet Results
Similarly to the above experiments, we apply the same method on ImageNet to find the bestaugmented policies. All the implemented details follow those of AutoAugment. The best policies found on ImageNet mainly focus on colorbased and Rotation transformation, which have some similarity with those found on CIFAR10. We then concatenate the best five subpolicies for ImageNet training with the same details of AutoAugment. From Table 3, we can see that the accuracies on all the models are optimized. To our best knowledge, there only exists a better result of Top1 error rate [20], which takes advantage of a large amount of weakly labeled extra data. An illustrative example of selected policies from ARSAug is visualized in Fig. 1.
Model  AutoAugment  ARSAug 

ResNet50  22.37/6.18  22.09/5.98 
ResNet200  20.00/4.99  19.76/4.67 
AmoebaNetB (6,190)  17.25/3.78  16.88/3.47 
AmoebaNetC (6,228)  16.46/3.52  16.12/3.28 
5 Conclusion
We have proposed an augmented random search method ARSAug for searching better data augmentation policies compared with AutoAugment. The discrete search space of AutoAugment has been changed to a continuous space to improve the searching performance. By fully taking advantage of the higher sample efficiency of ARS, ARSAug can find better policies for data augmentation and achieve stateoftheart accuracy on CIFAR10, CIFAR100, and ImageNet (without additional data). Our work still has some limitations. For example, the datasets we select are limited to the vision domain. Therefore, we consider our future work to apply our automatic augmentation approach to the audio/speech domain and try other search strategies to improve the performance and efficiency.
6 Acknowledgment
This work is partially supported by the National Natural Science Foundation of China (nos. 61751208)). The authors would like to acknowledge useful discussions and helpful suggestions from Zhifeng Gao (Microsoft Asia).
References

[1]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  [2] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” in International Conference on Document Analysis and Recognition, 2003, p. 958.
 [3] H. S. Baird, H. Bunke, and K. Yamamoto, Structured document image analysis. Springer Science & Business Media, 2012.
 [4] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation policies from data,” arXiv preprint arXiv:1805.09501, 2018.
 [5] H. Mania, A. Guy, and B. Recht, “Simple random search provides a competitive approach to reinforcement learning,” arXiv preprint arXiv:1803.07055, 2018.
 [6] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
 [7] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 5026–5033.
 [8] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolution strategies as a scalable alternative to reinforcement learning,” arXiv preprint arXiv:1703.03864, 2017.
 [9] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” arXiv preprint arXiv:1708.04896, 2017.
 [10] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
 [11] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [12] T. DeVries and G. W. Taylor, “Dataset augmentation in feature space,” arXiv preprint arXiv:1702.05538, 2017.
 [13] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging ai applications,” arXiv preprint arXiv:1712.05889, 2017.
 [14] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
 [15] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.
 [16] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
 [17] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
 [18] X. Gastaldi, “Shakeshake regularization,” arXiv preprint arXiv:1705.07485, 2017.
 [19] Y. Yamada, M. Iwamura, and K. Kise, “Shakedrop regularization,” arXiv preprint arXiv:1802.02375, 2018.
 [20] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, “Exploring the limits of weakly supervised pretraining,” arXiv preprint arXiv:1805.00932, 2018.
Comments
There are no comments yet.