Data augmentation is an effective way to improve the performance of CNN models for image recognition tasks, particularly when its policy is optimized for the target dataset. Conventional data augmentation for images consists of image transformation operations, such as random cropping and flipping, and color enhancing including modification of color intensities Krizhevsky2012; He2016b. However, designing good data augmentation strategies requires profound understanding of the target data and operations. For example, CutOut DeVries2017 randomly erases a patch region of each image and is known to improve performance on the CIFAR-10 dataset, but is also reported to degrade the performance on other datasets, e.g., ImageNet Lopes2019.
Therefore, automatically designing effective augmentation strategies according to target data and tasks is desirable to improve the performance of image recognition models. One approach to augment existing data is to generate new data samples by interpolating several training imagesHauberg2015; Devries2017; Zhang2018 or by using conditional generative models Antoniou2018; Baluja2018. However, this approach requires a large amount of labeled data Rather2017 and sometimes fails to improve the performance Ravuri2019, even if powerful conditional generative models are used. On the other hand, some methods improve the performance by efficiently selecting effective combinations of image transformation operations from exponentially large candidate pools Rather2017; Cubuk2018a; Zhang2020. In particular, AutoAugment Cubuk2018a and its family Ho2019; Lim2019; Hataya2019a; Lin2020; Cubuk2019; Berthelot2019a; Lin2020 optimized combinations of operations to improve validation performance and achieved state-of-the-art results.
These methods involve a bi-level optimization: the inner process optimizes parameters of a CNN on training data using a given combination of operations, and the outer process optimizes the combination of operations to maximize the validation performance. Particularly, the inner loop, i.e., training of a CNN, is usually expensive. Therefore, prior works use proxy tasks that adopt small subsets of training datasets or small models Cubuk2018a; Ho2019; Lim2019; Hataya2019a or reduce the search space Cubuk2019; Berthelot2019a; Lin2020 to keep the entire training feasible. Such approaches may result in sub-optimal solutions. We will further review and formalize this problem in Section 2.
In this paper, we tackle the original bi-level optimization problem directly without using proxy tasks or reducing the search space. We propose Meta Approach to Data Augmentation Optimization (MADAO), which optimizes CNNs and augmentation policies simultaneously by using gradient-based optimization. Here, policies are updated so that they directly increase CNNs’ performance. Naïvely applying gradient-based optimization to this bi-level optimization requires differentiation through the inner optimization process Finn2017b or computation of the inverse Hessian matrix Bengio2000, both of which suffer from large space complexity. These problems are fatal because data augmentation optimization needs to handle large networks, e.g., CNNs for ImageNet. We bypass these issues by using the implicit gradient method with Neumann series approximation. Thanks to these approximations, MADAO is simple with little overhead as shown in Section 1. Notably, this simplicity allows MADAO to scale to problems of ImageNet size, which has been nearly impossible for existing bi-level optimization methods hutter19.
We empirically demonstrate that MADAO learns effective data augmentation policies and achieves performance comparable or even superior to existing methods on benchmark datasets for image classification: CIFAR-10, CIFAR-100, SVHN and ImageNet, as well as four fine-grained datasets. In addition, we show that MADAO improves the performance of self-supervised learning on ImageNet. All of the reported results have been achievedwithout using dataset-specific configurations.
[linenos, xleftmargin=20pt, breaklines, frame=lines, mathescape, escapeinside=||, ]python cnn.initialize() # parameterized by |policy.initialize()| # parameterized by
for epoch in range(num_epochs): for train_data, val_data in data_loader: for i in range(num_inner_iters): input = |policy(|train_data[i]|)| # augment data by policy criterion = cnn.train(input) # referred to in the text cnn.update(criterion) vcriterion = cnn.val(val_data) # referred to in the text |policy.update(vcriterion)|
2 Generalizing Data Augmentation Optimization
2.1 Designing Data Augmentation Space
Let us define a set of input images and a set of operations consisting of data augmentation operations such as rotation and color inversion. In the AutoAugment family, each image is augmented by an operation
with a probability ofand a magnitude of as illustrated in Figure 1. The magnitude parameter can correspond for example to the degree of rotation, while some operations, such as inversion, have no magnitude parameter. By applying consecutive augmentation operations, each image results in possible images, that is, the number of images virtually increases. This formulation makes the size of the search space , where is the size of the operation set.
Operations and accompanied parameters need to be selected so that they minimize the validation criterion, such as the error rate. Usually, this selection is performed heuristicallyKrizhevsky2012; He2016b. However, Cubuk2018a showed that data-driven optimization surpasses handcrafted selection Cubuk2018a.
2.2 Generalizing AutoAugment Family
Let denote parameters of a CNN model and denote parameters of a policy for augmentation, i.e., selection of operations from the operation set and their accompanied parameters . Let the empirical risk be and the validation criterion be . and are training and validation datasets. In the case of classification task, , where is the number of classes.
Optimization of data augmentation policy in AutoAugment family methods can be generalized as
that is, optimizing CNNs on training data with policies that minimize validation criteria on validation data.
Naïvely solving this bi-level minimization problem takes a long time because CNN training is costly and the number of possible combinations of augmentation operations and their parameters is infeasibly large. Therefore, prior works tried to alleviate this problem in several ways. One direction is to reduce the search space over augmentation policies, i.e., dimension and range of the parameter Cubuk2019; Berthelot2019a; Lin2020. For example, RandAugment Cubuk2019 randomly samples operations from the operation set and shares among all operations. This reduction changes the outer problem in Equation 1 from to , which makes it possible to use a simple searching process, such as grid search. OHL-AutoAug Lin2020 enables online searching using policy gradient Williams1992 by restricting the search space only to a limited range.
On the other hand, some methods use proxy tasks that approximate Equation 1 to obtain (sub-) optimal policy to reduce the searching time of the inner optimization. The obtained policy is then used to train a CNN as Cubuk2018a; Ho2019; Lim2019; Hataya2019a in an “offline” manner. For instance, AutoAugment employs a proxy task that approximates the original inner problem as
with a smaller dataset and a smaller network for efficiency. The outer problem
is optimized by black-box optimization techniques, such as reinforcement learningCubuk2018a. Fast AutoAugment Lim2019 and Faster AutoAugment Hataya2019a approximate Equation 1 as minimizing distance of distributions between augmented images and original images without directly minimizing , which allows faster searching.
|Method||Direct Inner Problem||Direct Outer Problem|
|Fast AutoAugment Lim2019||✓|
In this paper, we propose to directly optimize the bi-level optimization problem in Equation 1. In other words, we optimize CNNs and augmentation policies simultaneously, i.e., in an online manner, without reducing the search space or using proxy tasks. With this simultaneous optimization, policies are expected to augment images according to the learning state of CNN models. Section 1 is a simple depiction of our approach, which we call Meta Approach to Data Augmentation Optimization, or in short, MADAO.
3.1 Optimizing Policies by Gradient Descent
MADAO directly optimizes the bi-level problem Equation 1 using gradient descent. To this end, we assume that and are differentiable w.r.t. . Taking -category classification as an example, cross entropy can be used as and , but error rate cannot. Here, is the indicator function, and is a CNN with a softmax output layer.
Gradient-based optimization of Equation 1 requires for iterative updating. Since the data augmentation implicitly affects the validation criterion, in other words, data augmentation is not used for validation, we obtain
Because of the requirement of , can be obtained. On the other hand, the exact computation of has a large space complexity, as we will describe in Equation 12. Yet, if this gradient was available, the policies could be optimized by gradient descent.
3.2 Approximating Gradients of Policy and Inverse Hessian
To obtain without suffering from a large space complexity, we can use the Implicit Function Theorem. If there exists a fixed point that satisfies , then there exists a function around such that . If this condition holds, we also obtain
We can obtain an approximated gradient using this property. Unfortunately, is usually large; therefore, computing the inverse Hessian matrix is prohibitively expensive as it usually requires computations. To avoid computing the inverse Hessian matrix, we use iterative methods based on the Neumann series, which boast better scalability than conjugate gradients in various problems Koh2017; Liao2018; Lorraine2019.
The Neumann series holds with a given squared matrix if . Using this property, Equation 3 can be approximated with a positive integer as
We regularize the norm by simply introducing a scalar as as Lorraine2019. We use with in the experiments. This Neumann series approximation can also help us avoid explicitly storing the Hessian matrix, whose space complexity is
, by using Hessian-vector products derived from the following identity:
This right-hand side can be used instead of storing the Hessian matrix and only has the space complexity of .
3.3 Differentiable Data Augmentation
To differentiate through , MADAO adopts the differentiable data augmentation pipeline following Hataya2019a. As described in Section 2.1, each image is augmented with an operation with a magnitude of and a probability of , which can be written as
This right-hand side can be written as with a binary stochastic variable
. Although the original Bernoulli distributionis not differentiable w.r.t. , Gumbel trick Jang2016
relaxes this restriction to enable backpropagation to update. Similarly, some color-enhancing operations are non-differentiable w.r.t. the magnitude
, so the straight-through estimatorBengio2013 is used for such operations. We clamp and
by a sigmoid function to limit their range to [0, 1].
MADAO uses a different method to select operations compared to Faster AutoAugment in order to accelerate training. MADAO selects operations using categorical distribution parameterized by a weight parameter , where . Since the original categorical distribution is non-differentiable as Bernoulli distribution, we use Gumbel-softmax with a temperature of . This distribution, referred to as , samples one-hot-like vectors as . Using this distribution, an operation is selected and applied as
Here, SG is the stop gradient operation, and thus, so that the transformation Equation 8 keeps the range in . Different from this approach, Faster AutoAugment applies all operations and takes the weighted sum of the outputs to approximate this selection.
3.4 Connection to Gradient-based Hyperparameter Optimization
As can been seen, Equation 1
is a hyperparameter optimization (HO) problem to neural networks. Traditional HO methods, such as grid search, random searchBergstra2012 and Bayesian optimization Snoek2012, have poor scalability to increasing dimensionality of hyperparameters hutter19. For that reason, gradient-based HO attracts attention. From HO viewpoint, policy parameters are hyperparameters.
after SGD steps with learning rate of . One approach to obtain is to unroll the steps in Equation 11 as Maclaurin2015; Franceschi2018; Shaban2018:
This unrolling requires to cache , and thus, the space complexity becomes , which might be prohibitive for large neural networks, while MADAO requires .
Alternatively, implicit gradient yields as explained in Equation 4. This approach needs the inverse Hessian matrix Bengio2000, but this computation is infeasible for modern neural networks with millions of parameters. Iterative methods, such as conjugate gradient Do2009; Domke2012; Pedregosa2016a; Rajeswaran or Neumann series approximation Lorraine2019, effectively compensate for this issue by approximating this inverse matrix in gradient hyperparameter optimization. Such iterative approximation methods using the Neumann series are also used in approximating influence function Koh2017 and enabling RNNs to handle long sequences Liao2018. We exploit the knowledge from these prior works and adopt the implicit gradient method with Neumann series approximation to efficiently handle large-scale datasets and CNNs.
4 Experiments and Results
This section describes the empirical results of our proposed method in supervised learning for image classification and self-supervised learning tasks. For classification tasks, we use CIFAR-10, CIFAR-100 Krizhevsky2009, SVHN Netzer2011 and ImageNet (ILSVRC-2012) Russakovsky2015. In addition, we also used fine-grained classification datasets: Oxford 102 Flowers Nilsback08, Oxford-IITT Pets parkhi12a, FGVC Aircraft maji13 and Stanford Cars Krause2013. For the self-supervised learning task, we used ImageNet. In all experiments except those on ImageNet, we set 10 % of the original training data aside as validation data and report error rates on test data. For ImageNet, we used 1 % of the training data as validation data. Note that this data split means that we use less training data than prior works and that changes the performance of baseline models. More details about experiment configurations are in Appendix B.
We used 14 operations for augmentation: ShearX, ShearY, TranslateX, TranslateY, Rotate, Invert, AutoContrast, Equalize, Solarize, Color, Posterize, Contrast, Brightness and Sharpness (see Appendix A
for more details.) To make these operations differentiable, we implemented them using PyTorchNIPS2019_9015 and kornia eriba2019kornia. We scale the magnitude of operations from to so that the magnitude of means no change and the magnitude of corresponds to the strongest level of the particular augmentation operation. We initialized the parameters of magnitude and probability with and the parameters for operation selection to be equal. That means that MADAO in its initial state is nearly equivalent to RandAugment with the magnitude of . We set the number of augmentation stages in all experiments below (see Figure 1).
For training, we used SGD with a momentum for CNN optimization and RMSprop optimizer with learning rate offor policy optimization, following Lorraine2019. Equation 4 requires the existence of fixed points that satisfies . Following Lorraine2019, we assume that iterations, which corresponds to pythonnum_inner_iters in Section 1, makes the parameters hold the condition. To further encourage parameters to satisfy the condition, we also used warm up, i.e., training CNNs without augmentation for the first epochs. We set and by default. Validation criterion
is the same loss function that is used as the training loss function, e.g., cross-entropy, for simplicity.
Training of WideResNet 28-2 on CIFAR-10 for 200 epochs with MADAO takes 1.7 hours, while training without MADAO takes 1.0 hours in our environment. As regards memory consumption, training with MADAO requires 3.38 GB of GPU memory, while training without MADAO requires 1.74 GB in this setting. We tune the magnitude parameter of RandAugment using three runs of random search and report the test error rate for the run with the lowest validation error rate.
CIFAR-10, CIFAR-100 and SVHN
Table 2 presents test error rates on CIFAR-10, CIFAR-100 and SVHN with various CNN models: WideResNet-28-2, WideResNet-40-2, WideResNet-28-10 Zagoruyko and ResNet-18 He2016b. We show the average scores of three runs. For comparison, we present the results of RandAugment and the default augmentation: Cutout DeVries2017, random cropping, and random horizontal flipping (except SVHN), following Cubuk2019, which we refer to as Baseline. For CIFAR-10 and CIFAR-100, we also present the results of AutoAugment, which are trained on the split training set using the policy provided by the authors 111https://github.com/tensorflow/models/tree/master/research/autoaugment. RandAugment and AutoAugment are selected here as representative methods that limit the search space and use proxy tasks, as discussed in section 2.2.
As can been seen, MADAO achieves performance superior to baseline methods, which demonstrates the effectiveness of directly solving the bi-level problem rather than using proxy tasks as AutoAugment or a reduced search space as RandAugment.
Table 3 shows the results on ImageNet with ResNet-50 He2016b trained in supervised and self-supervised learning manners. In the supervised learning setting, we trained models for 180 epochs. In addition, to showcase the effectiveness of MADAO in other tasks than supervised learning, we apply MADAO to contrastive self-supervised learning he2019moco for 100 epochs of training. We report the results of the linear classification protocol.
Most importantly, MADAO can scale to an ImageNet-size problem without using proxy tasks or reducing search space. We hypothesize that the slightly inferior performance of MADAO compared to RandAugment could be attributed to the fact that ImageNet has 1,000 categories and a single policy might be insufficient for such diverse data. We also believe that class- or instance-conditional data augmentation might be required to further improve performance, which we leave as an open problem.
To showcase the ability of MADAO to augment images according to target datasets, we conducted experiments on four fine-grained datasets with ResNet-18 He2016b. Table 4 shows the average test error rates over three runs. These datasets are nearly ten to twenty times smaller than CIFARs and SVHN, yet MADAO drastically improves the performance without using dataset-specific hyperparameters. This improvement emphasizes the importance of searching for good policies according to full target datasets from the full search space. MADAO can capture the characteristics of each fine-grained dataset and generate tailored policies for each dataset from a large search space.
|Supervised Learning||ResNet-50||23.7/ 6.9||22.5/ 6.4||23.0/ 6.6|
|Dataset||Model||# Classes||# Training Set||Standard||RandAugment||MADAO (ours)|
5.1 How Policies Develop during Training
In Figure 2, we present the development of policies during training on fine-grained datasets. As can be seen, each dataset has its specific operations that are selected, which could be thought as reflecting the characteristics of each dataset. Besides, the way of selection changes according to the learning phase. Note that the first 20 epochs are set to warm-up and the augmentation policy parameters are not updated. We show further observations in Appendix C.
5.2 How Inner Steps Affect the Performance
Figure 3 presents test error rates with various numbers of inner update steps using WideResNet-28-2 on CIFAR-10. CNNs yield the best performance when and the results indicate that there exists a trade-off between “exploration and exploitation” of obtained policies: a small number of inner steps might not correctly evaluate the current policies, while running a large number of inner steps might fail to explore better strategies. Importantly, unrolled-based implementations would require to store model caches, which is infeasible for with modern CNNs. On the other hand, MADAO can efficiently handle a large .
In this paper, we have proposed MADAO, a novel approach to optimize an image recognition model and its data augmentation policy simultaneously. To efficiently achieve this goal, we use the implicit gradient method with Neumann series approximation. As a result, the overhead of MADAO to the standard CNN training, w.r.t., time and memory, is marginal, which enables ImageNet-scale training. Empirically, we demonstrate on various tasks that MADAO achieves superior performance to prior works without restricting search space or using sub-optimal proxy tasks.
Data augmentation is known to boost the performance in various visual representation learning settings, such as semi-supervised learningBerthelot2019a; Suzuki2020, domain generalization Volpi2018, and self-supervised learning Chen2020. We believe that our method can be introduced into these representation learning methods and efficiently enhance their performance, as we showcased with self-supervised learning in this paper.
Potential Impacts of Our Work
MADAO boosts performance of image recognition models with little overhead and minimal hyperparameter tuning. These advantages help to reduce energy consumption, which is a known issue of large-scale hyperparameter optimization, such as neural architecture search strubell2019. MADAO showcases superior performance, especially on fine-grained datasets. This property enables further application of image recognition models to various domains, but it may also help in misuse of image recognition.
Appendix A Operations Used in Policies
We introduce the operations used in MADAO in Table 5. Internally, magnitudes are restricted within range with sigmoid function and rescaled to the appropriate range. For example, we multiply the internal magnitude for ShearX operation by . As can been seen, there are three operations that have no magnitude parameters. Therefore, each policy has learnable parameters, where 11 corresponds to the number of magnitude parameters, e.g., , the first 14 corresponds to the number of probability parameters, e.g., , and the second 14 corresponds to operation selection parameters . In our experiments, we set , and thus, the total number of learnable parameters is 78. Note that the original implementation of RandAugment does not include Invert in the operation set but we perform experiments with RandAugment using the same operation set as we use for our proposed method, that is including Invert.
|Operation||Original Magnitude Range|
|Color Enhancing Operations||Invert||none|
Appendix B Experimental Details
|Number of inner steps||corresponds to num_inner_iters in Section 1||30|
|Warm-up epochs||Initial th epochs that policy is not updated||20|
|Temperature||for operation selection||0.05|
b.1 CIFAR-10, CIFAR-100 and SVHN
On CIFAR-10 and CIFAR-100, we trained WideResNets and ResNet-18 for 200 epochs. We used SGD with the initial learning rate of 0.1, the momentum of 0.9 and the weight decay of . The learning rates ware scheduled with cosine annealing with warm restart Loshchilov2016. On SVHN, we trained WideResNet 28-2 for 160 epochs. We used SGD with the initial learning rate of , the momentum of 0.9 and the weight decay of . The learning rate is divided by 10 at 80th and 120th epochs. On CIFAR-10, CIFAR-100 and SVHN, we set the batch size to 128.
On ImageNet, we trained ResNet-50 for 180 epochs with SGD of the base initial learning rate of 0.1, the momentum of 0.9 and the weight decay of . The learning rate is divided by 10 at 60th, 120th and 160th epochs. We set the batch size to 1,024 so that we scale the initial learning rate to 0.4. As the standard data augmentation, we randomly cropped images into pixels and randomly flipped horizontally.
We follow the experimental settings of MoCo he2019moco and trained for 100 epochs with the batch size of 512 and the queue size of 65,536. We applied augmentation to images for both key and query networks. For linear classification, we used SGD with the momentum of 0.9 and set the initial learning rate to 30, which is decayed by 10 at 60th and 80th epoch.
b.3 Fine-grained classification
On fine-grained datasets, we trained ResNet-18 for 200 epochs and set the batch size to 64. As the standard data augmentation, we used the same strategy to ImageNet, including random cropping into pixels.
Appendix C Additional Results
c.1 How Policies Develop during Training
Figure 4 shows how the selection probabilities for each operation develop during training on CIFAR-10 and SVHN. Similar to fine-grained datasets shown in Figure 2, the policies for CIFAR-10 and SVHN also show clear difference to each other. As can be observed, the first and second stage for each dataset evolve differently, which indicates that the stages develop complementarily to each other.
We also present the development of probability parameters and magnitude parameters in Figure 5. Interestingly, magnitude parameters diverge as training proceeds, while probability parameters remain in the range around the initial value. This observation partially agrees with the way of RandAugment, where RandAumgnet removes the probability parameter from its hyperparameters. At the same time, this results imply that the optimal magnitudes might be non-global, which disagrees with RandAugment.
c.2 How Warm-up Affects the Performance
Figure 6 shows the relationship between the warm-up epochs and the final test error rates. There is no significant difference between the selection of warm-up epochs between 0 and 30.
c.3 Comparison to Policies without Training
On CIFAR-10 with WideResNet-28-2, the initialized policies yield test error rates of 4.7, which is even with RandAugment, as intended. Therefore, policy training yields 0.1 % of performance gain.