Data Augmentation Revisited: Rethinking the Distribution Gap between Clean and Augmented Data

09/19/2019 ∙ by Zhuoxun He, et al. ∙ HUAWEI Technologies Co., Ltd. Shanghai Jiao Tong University 0

Data augmentation has been widely applied as an effective methodology to prevent over-fitting in particular when training very deep neural networks. The essential benefit comes from introducing additional priors in visual invariance, and thus generate images in different appearances but containing the same semantics. Recently, researchers proposed a few powerful data augmentation techniques which indeed improved accuracy, yet we notice that these methods have also caused a considerable gap between clean and augmented data. This paper revisits this problem from an analytical perspective, for which we estimate the upper-bound of testing loss using two terms, named empirical risk and generalization error, respectively. Data augmentation significantly reduces the generalization error, but meanwhile leads to a larger empirical risk, which can be alleviated by a simple algorithm, i.e. using less-augmented data to refine the model trained on fully-augmented data. We validate our approach on a few popular image classification datasets including CIFAR and ImageNet, and demonstrate consistent accuracy gain. We also conjecture that this simple strategy implies a generalized approach to circumvent local minima, which is of value to future research on model optimization.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Recent years have witnessed a rapid development of deep learning approaches. With the availability of powerful computational resources such as GPUs, it is possible to train very deep neural networks that have been proved to achieve good performance in a wide range of computer vision tasks including image classification, object detection, semantic segmentation,

etc. On the other hand, complicated models often require a larger amount of training data, whereas collecting and annotating a large dataset is always expensive in particular for fine-level recognition problems. Nowadays, the most popular dataset for image classification, ILSVRC2012 [Russakovsky et al.2015], contains only million images, but it is widely used for training networks with tens of millions of parameters. This makes the training process ill-posed, and researchers have to use a set of tricks to prevent over-fitting.

As a useful and popular training trick, data augmentation is designed to increase the diversity of training data without actually collecting new data. Essentially, it is driven by the fact that slight modification on an image, e.g., horizontal flip, cropping, affine transform, slight contrast adjustment, etc., will not change the high-level semantics of an image. Under this assumption, people generated images using a combination of these modifications and fed them as ‘individual’ training data to avoid the model from sticking on the only view of each training sample. Standard data augmentation methods used in deep network training involves random flip and fixed-size cropping, and several sophisticated variants such as Mixup [Zhang et al.2017b], Cutout [DeVries and Taylor2017], and their combination, Cutmix [Yun et al.2019]. Recently, researchers even designed an automated framework of learning an optimal set of parameters for augmentation [Cubuk et al.2019], and achieved state-of-the-art performance in a few image classification benchmarks.

As more and more complicated data augmentation methods being proposed, two questions remain mostly uncovered: is there a gap between the distributions of clean and augmented, and how does such gap impact performance in the testing scenario? This paper delves deep into these issues by bounding the upper-bound of testing loss with two terms, i.e., an empirical risk term and a generalization error term, with the former reflecting how well the trained model performs on training data, and the latter measuring the gap between training and testing data distributions, respectively. From this perspective, data augmentation is a two-edged sword, which reduces the generalization error with a larger coverage of data space, and increases the empirical risk in particular when the generated samples are not likely to occur in the testing scenario. This phenomenon is particularly significant when recent augmentation methods have become more and more complicated, e.g., in Mixup [Zhang et al.2017b], each training image is a weighted overlay of two individual samples.

Motivated by this observation, we propose an effective training strategy to reduce the empirical risk. Our approach is simple and practical, which can be implemented in a few lines of code. The idea is to add a refinement stage with clean training data, after the conventional training process in which augmented data are used. This is to confirm that no augmentation is used at the end of the training stage, which we believe is a safe opportunity of transferring the trained model to the testing scenario. As shown in Figure 1, by plotting the curves during the refinement process, we observe that refining the models without intensive data augmentation decreases the empirical risk further, and meanwhile a lower testing loss is achieved.

Figure 1: Cross entropy loss curves in training PreAct-ResNet-18 on CIFAR100, with two popular data augmentation methods, namely, Mixup and AutoAugment. Mind the gap between clean and augmented training

data, which reflects a high empirical risk brought by data augmentation. In the final 50 epochs (after the dashed line), data augmentation is removed (red curves are terminated) and lower testing losses (higher accuracy) are achieved.

We evaluate our training strategy in a few popular image classification benchmarks including CIFAR10, CIFAR100, Mini-ImageNet, and ImageNet. In all these datasets, the refined model achieves consistent, sometimes significant, accuracy gain, with computational costs in training and testing remain unchanged. Beyond image classification experiments, this research sheds light on a new optimization method, which involves using augmented data followed by clean data to optimize global and local properties, and is worth being further studied in the future.

Related Work

Statistical learning theory [Vapnik1998] proofs that the generalization of the learning model is characterized by the capacity of the model and the amount of training data. Data augmentation methods have been applied widely in deep learning by creating artificial data to improve model robustness, especially in image classification [Krizhevsky, Sutskever, and Hinton2012, Szegedy et al.2015]. Basic image augmentation methods mainly include elastic distortions, scale, translation, and rotation [Ciregan, Meier, and Schmidhuber2012, Sato, Nishimura, and Yokoi2015]. For natural image datasets, random cropping and horizontal flipping are common. Besides, Scaling hue, saturation, and brightness also improve performance [He et al.2019]. The methods are usually designed by retaining semantic information unchanged while generating various images.

Recently, some methods to synthesize training data were proposed. Mixup [Zhang et al.2017b]

combines two samples linearly in pixel level, where the target of the synthetic image is the linear combination of one-hot label encodings. To address “manifold intrusion” caused by Mixup, additional neural networks are used to generate more sensible interpolation coefficients 

[Guo, Mao, and Zhang2019]. Manifold mixup [Verma et al.2019] randomly performs linear combination not only in the input layer but also in some hidden layers. Differently, CutMix [Yun et al.2019] combines two samples by cutting and pasting patches. Compared to moderate data augmentation, augmented data generated by Mixup and its variants is much more different from original data and unreal to human perception, even though those methods try to decrease the distribution gap between clean and synthetic data.

For different datasets, the best augmentation strategies can be different. Learning more sensible data augmentations for specific datasets has been explored [Lemley, Bazrafkan, and Corcoran2017, Tran et al.2017]. AutoAugment [Cubuk et al.2019] uses a search algorithm to find the best data augmentation strategies from a discrete search space that is composed of augmentation operations.

Data Augmentation Revisited

In this section, we revisit data augmentation by identifying the loss terms that compose the upper-bound of the testing loss. Then, we provide an explanation on how data augmentation works and how it magnifies the empirical risk, followed by a simple and practical approach to improve the performance of models trained by data augmentation.

Statistical Learning with Data Augmentation

The strategy of data augmentation can be conservative, in which only a small number of ‘safe’ operations such as horizontal flip and cropping are considered, or aggressive, in which ‘dangerous’ or a series of operations can be used to cause significant changes on image appearance. Here we briefly introduce several aggressive data augmentation methods, all of which were proposed recently and verified effectiveness in image classification tasks.

Let and be the data and label spaces, respectively. Each sample is denoted by , where

is the joint distribution of data and label.

Mixup, Manifold Mixup, and CutMix

For and , the generated data by Mixup [Zhang et al.2017b] is obtained as follows:


where and

is the combination ratio (a hyperparameter). Manifold Mixup 

[Verma et al.2019] randomly performs the linear combination at a eligible layer that can be input layer and some hidden layer.

Cutmix [Yun et al.2019] combines CutOut [DeVries and Taylor2017] and Mixup, providing a patch-wise, weighted overlay by:


where is a binary mask indicating the positions of drop-out and fill-in, denotes element-wise multiplication and is the combination ratio.


Instead of manually designing data augmentation tricks, AutoAugment [Cubuk et al.2019]

applied an automated way of learning parameters for augmentation. A large space with different kinds of operations was pre-defined, and the policy of augmentation was optimized with reinforcement learning. This is to say, the function

applied to each sample for augmentation can be even more complicated compared to those in conventional approaches.

Consider a regular statistical learning process. The goal of learning is to find a function which minimizes the expected value of a pre-defined loss term, , over the distribution of . This is known as the expected risk:

However, the data distribution is unknown in practical situations. Therefore, a common solution is the empirical risk minimization (ERM) principle [Vapnik1998], which optimizes an empirical risk in a training dataset that mimics the data distribution:

The accuracy of estimating the expected risk goes up with , the amount of training data. In practice, especially when there is limited data, increasing with data augmentation is a popular and effective solution. It defines a function , which generates ‘new’ data with a combination of operations on – since these operations do not change the semantics, naturally shares the same label with , i.e., Note that data augmentation actually changes the distribution of and we denote the new distribution by , which is to say, the goal has been changed from minimizing to minimizing :

Rethinking the Mechanism of Data Augmentation

According to VC theory [Vapnik1998]

, the consistency and generalization of ERM principle have been justified in theoretical aspect. Consider a binary classifier

, which has finite VC-Dimension

. With probability

, a upper bound of the expected risk is formulated by


where . In the simply (separable) case, , and in the complex (non-separable) case, .

According to Equation (3), sufficiently large training data helps to decrease the expected risk upper bound. Based on this theory, data augmentation creates sensible artificial data to increase the training data size.

Assume that and are the same sample, if , where is a small value. Then there is a finite number of augmented training data. For the model trained over the augmented data distribution , we have


where is a finite constant. Here we suppose that the coefficient retains unchanged.

The expected risk upper bound is determined by the empirical risk and the generalization error. Although the generalization error in Equation (4) is smaller than that in Equation (3), there is a difference between the other risk terms. Assume that


Thus, the inequality is reformulated by


where is the approximation error caused by the distribution gap between and .

Equation (6) highlights that the benefits of learning with data augmentation arise due to three factors:

  1. the amount of augmented data being large,

  2. the convergences of and being consistent, and

  3. the distribution gap between clean and augmented data being small.

Factor 1) is easy to understand. Factor 2) is related to optimization methods. Note that is directly optimized on , but not . That means the best function found by the optimization procedure is not guaranteed to be the best one which minimizes the empirical risk . Factor 3) is based on the assumption that the approximation error is positively correlated with the distribution gap between clean and augmented data.

Obviously, Factors 2) and 3) are related but not equivalent. When and are the same distribution, . However, there is a non-negligible distribution gap for many data augmentation operations, such as the transformation Invert, Mixup, Cutmix, and so on. Previous empirical results show their effectiveness on various datasets without regard to the distribution gap between clean and augmented data.

We conjecture that there is a trade-off among these factors, although they do not mean to contradict each other. All the data augmentation methods we introduced before focused on Factors 1) and 2) but ignored Factor 3). Three mixup-based methods mostly explored additional data within a neighborhood of each training sample, therefore, the convergence of both and is mostly guaranteed. Similarly, the goal of AutoAugment is to optimize validation accuracy, which is closely related to the convergence of . However, none of these approaches manipulated with the difference between clean and augmented training data: mixup-based methods generated samples significantly different from a clean image via pixel-wise image overlay, and AutoAugment performed several operations, one by one, so that the generated image can be heavily edited. Consequently, all these methods can lead to an intensive distribution gap, which causes a large approximation error and impacts the convergence of .

Convergence of Empirical Risk

The effectiveness of intensive data augmentation can be described as the following question: How can one guarantee the consistency of convergences of and , when a large distribution gap between clean and augmented data exists?

Let and

respectively denote latent space (feature space) and the feature vector of

. Suppose a perfect classifier is given by the true conditional distribution is . Consider the marginal conditional distribution , where represents some feature of sample, and .

For one certain category, different features have different importance to classify accurately. If the

is a uniform distribution, the

is uninformative for classifying. We define that if the possibility density concentrates on some point mass, the is a major feature. Conversely, if the possibility density is relatively uniform, the is a minor feature.

For the sample and the augmented sample , if the convergences of and are relatively consistent, it requires the conditional distributions are similar:


Consider a intensive data augmentation method which brings a large distribution gap between clean and augmented data. Then, we have


where is a relative large value.

In this situation, features of the sample have changed a lot, but the label keeps the same. That means


where is a small value.

According to Equations (7)–(9), we give a general reasonable data augmentation method which satisfies:

  1. for major features, ,

  2. for minor features, ,

where is a small value, and .

These two inequality highlight that the major features that are important to classify should be preserved as much after data augmentation and the minor features can be changed a lot after data augmentation. This is consistent with previous empirical results [Cubuk et al.2019].

For numeral recognition, the transformation Invert is successful to be used, even though the numeral specific color is changed to that not involved in the original dataset. On the other hand, the transformation horizontal flipping used commonly in natural images is never used in numeral recognition. It is consistent with prior knowledge that the relative color of the numeral and its background and the direction of the numeral are major features, but the specific color of numeral is a minor feature.

Refined Data Augmentation

Here we propose a simple approach called Refined Data Augmentation, that incorporates existed data augmentation methods, as described in Algorithm 1.

The motivation is that the large distribution gap caused by intensive data augmentation methods impacts the convergence of . On the assumption that data augmentation helps the DNN converge to a region close to the global minimum, the DNN refined without intensive data augmentation in the ending stage of training generalizes better on the test set.

0:  A training set: ; An intensive set of data augmentation ; Augmentation and refinement epochs and ;
  for epoch in  do
     Sampling an augmented training set: with , ;
     Training on augmented training set;
  end for
  for epoch in  do
     Refining on the original training set;
  end for
  A trained model: .
Algorithm 1 Training a Deep Neural Network Refined Data Augmentation

On one hand, at the beginning of training DNNs, focusing on the minor features, such as texture, color, and background, etc., easily causes over-fitting. Intensive data augmentation makes the DNN focus on the major features to avoid convergence to a bad local minimum. On the other hand, the distribution gap between clean and augmented data can obstruct the convergence of . In general, is relatively large when the model is trained with intensive data augmentation. Refining the DNN without intensive data augmentation guarantees the convergence of to a minimum, and reduces the impact of a distribution gap. Since the has crossed many bad local minimums and reached to a region close to a better minimum by optimizing , the DNN with refinement by a small learning rate will not be back to these bad minimums.

An Alternative Viewpoint: Regularization

The generalization ability of DNN is greatly effected by optimization methods [Hardt, Recht, and Singer2016, Zhang et al.2017a]. Here we associate data augmentation with the optimization of DNN.

Gradient-based optimization methods are used to minimize the empirical risk of DNN, which has many local minimums. Even with some regularization methods [Srivastava et al.2014, Ioffe and Szegedy2015], the function found by the gradient-based procedure is not guaranteed to be the best one. The gradient-based optimization methods mean that the DNN tends to learn low-level features firstly that can be easier to learn for machines such as size, color, and texture [Gatys, Ecker, and Bethge2015, Gatys, Ecker, and Bethge2017, Geirhos et al.2019]. More recently, brendel2018approximating brendel2018approximating found that the decision-making behavior of current neural network architectures is mainly based on relatively weak and local features, and high-level features that can better improve model robustness are not sufficiently learned.

As discussed earlier, effective data augmentation methods try to make the major features of data invariant, while losing some minor features. Therefore, data augmentation highlights the major features. From this perspective, data augmentation as a regularization scheme imposes some constraints on the function space by prior knowledge, which forces the model to focus on the major features.

Figure 2: The contour maps of the empirical risks on the function space . Left: . Right:, where the “” denotes the global minimum of .

Figure 2 depicts the effect of data augmentation on changing the empirical risk on the function space . Data augmentation helps the neural network learn major features, which reduces the number of local minimums and keeps the direction of convergence relatively consistent. Trained with clean data, the DNN easily converges to a local minimum that is far away from the global minimum. With effective data augmentation, the DNN converges to a region close to the global minimum.

The understanding of data augmentation is coincide with the empirical results. In general, generalizes better on test dataset, although , where is the function found by and is the function found by . Note that is a local minimum, but not the global minimum. It is reasonable to believe that the is closer to the global minimum than .


We evaluate refined data augmentation on top of several augmentation methods, including Mixup [Zhang et al.2017b], Manifold Mixup [Verma et al.2019], Cutmix [Yun et al.2019], and AutoAugment [Cubuk et al.2019] on common classification datasets: CIFAR10, CIFAR100, Tiny-ImageNet and the ImageNet-2012 classification dataset [Krizhevsky, Hinton, and others2009, Russakovsky et al.2015]. Besides, we study the distribution gap between clean and augmented data caused by various data augmentation methods. Analytical experiments are conducted to show the effectiveness of our method and over-fitting of data augmentation methods.

CIFAR10 and CIFAR100

The CIFAR10 consists of 60,000 color images in 10 classes, where 5,000 training images per class as well as 1,000 testing images per class. The CIFAR100 is just like the CIFAR10 but contains 500 training images and 100 testing images per class in total of 100 classes. On CIFAR10 and CIFAR100, we train both two variants of residual networks [He et al.2016a]: PreActResNet-18 [He et al.2016b] and WideResNet-28-10 [Zagoruyko and Komodakis2016]. Besides, a stronger backbones: Shake-Shake [Gastaldi2017] is implemented on CIFAR100. We partition the training procedure into two stages: training with intensive data augmentation and refinement.

Training with Intensive Data Augmentation

We train PreActResNet-18 and WideResNet-28-10 on a single GPU using PyTorch for 400 epochs on training set with a minibatch size of 128. For PreActResNet-18, the learning rate starts at 0.1 and is divided by 10 after 150, 275 epochs, and weight decay is set to be

. For WideResNet-28-10, the learning rate starts at 0.2 and is divided by 5 after 120, 240, 320 epochs except using a Cosine learning rate decay [Keskar and Socher2017] for AutoAugment, and weight decay is set to be . Following the setting in zhang2017mixup zhang2017mixup and Cubuk_2019_CVPR Cubuk_2019_CVPR, we do not use dropout in the experiments of Mixup, Manifold Mixup, and Cutmix, and drop rate is set to be 0.3 in the experiments of AutoAugment for WideResNet-28-10. For Shake-Shake model, we follow the training procedure of gastaldi2017shake gastaldi2017shake and Cubuk_2019_CVPR Cubuk_2019_CVPR to train the model on 2 GPUs for 1800 epochs with a mini-batch size of 128. The learning rate starts at 0.01 and is annealed using a Cosine function without restart, and weight decay is set to be .

All intensive data augmentation methods are incorporated with standard data augmentation: random crops, horizontal flips with a probability of 50%. For the coefficient in Mixup and Manifold Mixup, . Following the paper [Yun et al.2019], Cutmix is implemented with 50% probability during training. For AutoAugment, we first apply the standard data augmentation methods, then apply the AutoAugment policy, then apply Cutout with pixels [DeVries and Taylor2017] following Cubuk_2019_CVPR Cubuk_2019_CVPR. Note that we directly use the AutoAugment policies that are reported in the previous paper.


We refine the models without these intensive data augmentation methods. Since the standard data augmentation methods are moderate which bring a very small distribution gap between clean and augmented data, we preserve the standard data augmentation when refining. For PreActResNet-18 and WideResNet-28-10, the models are refined for 50 epochs, and for Shake-Shake, the models are refined for 200 epochs. During refinement, the learning rate keeps a small value. For the models trained with the step-wise learning rate decay, the learning rate is set to be the same as that in the final epoch of the last stage, and for the models trained with the Cosine learning rate decay, the learning is adjusted to a reasonably small value.

Models Augmentation w/ Refinement
C10 C100 C10 C100
Moderate 94.6 75.4 94.5 76.0
Mixup 95.9 78.8 96.4 80.8
Manifold Mixup 95.8 80.4 96.0 81.5
Cutmix 96.3 79.8 96.4 80.6
AutoAugment 96.0 79.3 96.2 80.0
Moderate 96.2 81.8 - -
Mixup 97.2 82.5 97.5 84.6
Manifold Mixup 97.2 82.7 97.4 84.6
Cutmix 97.1 82.9 97.3 83.5
AutoAugment 97.3 83.8 97.5 84.7
Shake-Shake (26 2x96d)
Moderate 82.9 -
Mixup 84.2 85.2
Cutmix 84.5 84.7
AutoAugment 85.7 86.1
Table 1: Classification Accuracy () on the test set of CIFAR10 and CIFAR100.

In Table 1, our methods show a consistent improvement with different data augmentation methods on various backbones. Especially for Mixup, refining the models with moderate data augmentation improves the accuracy significantly on CIFAR100. On CIFAR10, our method achieves a 97.97% test accuracy for the networks trained for 600 epochs, 550 epochs for AutoAugment and 50 epochs for refinement, which is searched by P-DARTS [Chen et al.2019] and achieves 97.81% with AutoAugment after being trained for 600 epochs. We also conduct the experiments that refining the models trained with the standard data augmentation without any data augmentation methods for PreActResNet-18. There is a 0.6% accuracy gain on CIFAR100, while a 0.1% drop on CIFAR10.

In Table 2, cross entropy (CE) losses on clean and augmented data are calculated to quantify the distribution gap between clean and augmented data to some extent. Compared with other methods, Mixup brings the most significant difference between original and augmented data, which can explain the significant improvement with refinement for Mixup. Interestingly, AutoAugment achieves a lower CE loss on cleaning training data than moderate data augmentation. But refinement for AutoAugment still works well, and the CE loss on clean data decreases further after being refined. These results suggest that data augmentation indeed helps the model converge to a better region, and the gap between clean and augmented data obstructs the further convergence.

Methods Augmentation w/ Refinement
Moderate 3.3 1.1 - 1.0
Mixup 1356 98 4.7 2.4
Manifold Mixup 1253 67 4.2 2.2
Cutmix 785 14 1.9 0.8
AutoAugment 245 0.9 1.5 0.8
Table 2: Cross entropy losses () of augmented and clean training data on CIFAR100 for PreActResNet-18. and repectly refer to and .

Ablation Studies

Training Epochs

In previous experiments, we train models for a few more epochs to refine. Here we keep the total training epochs constant to prove the effectiveness of our method. We train PreActResNet-18 on CIFAR100 with Mixup, and the learning rate is divided by 10 at epochs 150 and 275. Figure 3 shows the test error curve with different refining epochs when total of 400 training epochs. Besides, if we train models on CIFAR with intensive data augmentation for a fixed number of epochs, refining epochs will not influence results much once convergence.

Figure 3: Test error (averaged over 5 runs) of PreActResNet-18 trained on CIFAR100 as a function of number of refining epochs (total of 400 training epochs). Bars represent the range of test errors for each number.

We also find that increasing the number of training epochs with intensive data augmentation benefits the performances after refinement, even though AutoAugment causes a obvious over-fitting on augmented data with long training epochs. Table 3 shows refinement improves accuracy significantly, and suppresses the over-fitting on augmented data.

Models AutoAugment w/ Refinement
, 79.3 80.0
, 78.5 80.5
, 83.7 84.2
, 83.8 84.7
Table 3: Classification accuracy () for different training epochs with AutoAugment on CIFAR100. and refer to the number of epochs with AutoAugment and the number of refinement epochs, respectively.

Learning Rate while refining

In previous theoretical analysis, we conjecture that refinement with a small learning rate will not cause the DNN back to these bad minimums. We conduct experiments that a large learning rate starts at the stage of refinement, and decays gradually. The performances are worse than refining with a small learning rate.

From Augmented Data to Clean Data Gradually

We also try to weaken the intensity of data augmentation gradually so that refining models from augmented data to clean data gradually. For Mixup, is decreased to 0 gradually. For Cutmix and AutoAugment, we decrease the probability to implement data augmentation gradually. The results show no significant difference with the earlier experiments.


Tiny-ImageNet consists of 200 image classes with 500 training and 50 validation per class. The samples are color images. We train models on the training set and test on the validation set. We train PreActResNet-18 for 400 epochs with intensive data augmentation, and refining the models for 50 epochs. Other hyper-parameters about model training are the same as the settings in previous experiments.

For the coefficient in Mixup and Manifold Mixup, we set to be and . We also show the results of different probabilities to implement Cutmix. The data augmentation policies found by AutoAugment is searched on CIFAR10 and ImageNet. Here we implement both CIFAR-policy and ImageNet-policy on Tiny-ImageNet. Following the setting in the paper [Cubuk et al.2019], we apply Cutout with pixels after CIFAR-policy. The results are recorded in Table 4.

PreActResNet-18 Augmentation w/ Refinement
Best Last Best Last
Moderate 60.84 60.68 - -
63.18 63.08 64.54 64.20
63.95 63.34 65.45 65.08
Manifold Mixup
63.66 63.28 64.54 64.33
64.88 64.43 65.98 65.80
64.90 64.61 65.84 65.59
65.97 65.23 66.29 65.87
CIFAR-Policy 65.08 64.29 65.31 65.12
ImageNet-Policy 61.06 60.65 61.82 61.75
Table 4: Classification accuracy () on the validation set of Tiny-ImageNet. Both best and last results are reported.

ImageNet (ILSVRC2012)

The ILSVRC2012 classification dataset [Russakovsky et al.2015] consists of 1,000 classes of images, including 1.3 million training images and 50,000 validation images. We train models with initial learning rate 0.1 and a mini-batch size of 256 on 8 GPUs and follow the standard data augmentation: scale and aspect ratio distortions, random crops, and horizontal flips. For Mixup, the models are trained for 200 epochs, and the learning rate is divided by 10 at epochs 60, 120, 180. Following zhang2017mixup zhang2017mixup, we set to be 0.2 for ResNet-50 and 0.5 for ResNet-101. For AutoAugment, ResNet-50 is trained for 300 epochs, and the learning rate is divided by 10 at epochs 75, 150, and 225, while ResNet-101 is trained for 200 epochs. We refine all models with standard data augmentation for 20 epochs by learning rate of . In Table 5, ResNet-50 for Mixup of performs worse than Mixup of , however, they achieve similar accuracy after being refined.

Models Augmentation w/ Refinement
Top-1 Top-5 Top-1 Top-5
Standard 76.39 93.19 - -
Mixup () 77.47 93.75 77.69 93.83
Mixup () 77.26 93.78 77.65 93.93
AutoAugment 77.83 93.70 77.98 93.86
Standard 78.13 93.71 - -
Mixup () 79.41 94.70 79.61 94.73
AutoAugment 79.20 94.45 79.33 94.46
Table 5: Classification accuracy () on ImageNet.


This paper presents a simple but effective algorithm for network optimization, which applies (mostly complicated) augmentation for generating abundant training data, but uses clean data to refine the model in the last training epochs. In this way, it is possible to arrive at a reduced testing loss, with the generalization error and empirical loss balanced. We also show intuitively that augmented training enables the model to traverse over a large range in the feature space, while refinement assists it to get close to a local minimum. Consequently, models trained in this manner achieve higher accuracy in a wide range of image classification tasks, including in CIFAR and ImageNet datasets.

Our work sheds light on another direction of data augmentation which is complementary to the currently popular trend that keeps designing more complicated manners for data generation. It is also interesting to combine refined augmentation with other algorithms, e.g.

, a cosine-annealing schedule for refinement, or add this option to the large space explored in automated machine learning.


  • [Brendel and Bethge2019] Brendel, W., and Bethge, M. 2019. Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. In International Conference on Learning Representations.
  • [Chen et al.2019] Chen, X.; Xie, L.; Wu, J.; and Tian, Q. 2019. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In Internation Conference on Computer Vision.
  • [Ciregan, Meier, and Schmidhuber2012] Ciregan, D.; Meier, U.; and Schmidhuber, J. 2012. Multi-column deep neural networks for image classification. In

    The IEEE Conference on Computer Vision and Pattern Recognition

  • [Cubuk et al.2019] Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le, Q. V. 2019. Autoaugment: Learning augmentation strategies from data. In The IEEE Conference on Computer Vision and Pattern Recognition.
  • [DeVries and Taylor2017] DeVries, T., and Taylor, G. W. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.
  • [Gastaldi2017] Gastaldi, X. 2017. Shake-shake regularization. arXiv preprint arXiv:1705.07485.
  • [Gatys, Ecker, and Bethge2015] Gatys, L.; Ecker, A. S.; and Bethge, M. 2015.

    Texture synthesis using convolutional neural networks.

    In Advances in neural information processing systems.
  • [Gatys, Ecker, and Bethge2017] Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2017. Texture and art with deep neural networks. Current Opinion in Neurobiology 46:178 – 186. Computational Neuroscience.
  • [Geirhos et al.2019] Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F. A.; and Brendel, W. 2019. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations.
  • [Guo, Mao, and Zhang2019] Guo, H.; Mao, Y.; and Zhang, R. 2019. Mixup as locally linear out-of-manifold regularization. In

    The AAAI Conference on Artificial Intelligence,

  • [Hardt, Recht, and Singer2016] Hardt, M.; Recht, B.; and Singer, Y. 2016.

    Train faster, generalize better: Stability of stochastic gradient descent.

    In International Conference on Machine Learning.
  • [He et al.2016a] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016a. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition.
  • [He et al.2016b] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016b. Identity mappings in deep residual networks. In European conference on computer vision. Springer.
  • [He et al.2019] He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; and Li, M. 2019. Bag of tricks for image classification with convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition.
  • [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning.
  • [Keskar and Socher2017] Keskar, N. S., and Socher, R. 2017. Improving generalization performance by switching from adam to sgd. In International Conference on Learning Representations.
  • [Krizhevsky, Hinton, and others2009] Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.
  • [Lemley, Bazrafkan, and Corcoran2017] Lemley, J.; Bazrafkan, S.; and Corcoran, P. 2017. Smart augmentation learning an optimal data augmentation strategy. IEEE Access 5:5858–5869.
  • [Russakovsky et al.2015] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115(3):211–252.
  • [Sato, Nishimura, and Yokoi2015] Sato, I.; Nishimura, H.; and Yokoi, K. 2015. Apac: Augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229.
  • [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958.
  • [Szegedy et al.2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition.
  • [Tran et al.2017] Tran, T.; Pham, T.; Carneiro, G.; Palmer, L.; and Reid, I. 2017. A bayesian data augmentation approach for learning deep models. In Advances in neural information processing systems.
  • [Vapnik1998] Vapnik, V. 1998. Statistical learning theory. Wiley New York.
  • [Verma et al.2019] Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Mitliagkas, I.; Lopez-Paz, D.; and Bengio, Y. 2019. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning.
  • [Yun et al.2019] Yun, S.; Han, D.; Oh, S. J.; Chun, S.; Choe, J.; and Yoo, Y. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision.
  • [Zagoruyko and Komodakis2016] Zagoruyko, S., and Komodakis, N. 2016. Wide residual networks. In Proceedings of the British Machine Vision Conference.
  • [Zhang et al.2017a] Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O. 2017a. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
  • [Zhang et al.2017b] Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017b. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.