Online Hyper-parameter Learning for Auto-Augmentation Strategy

05/17/2019 ∙ by Chen Lin, et al. ∙ SenseTime Corporation The University of Sydney The Chinese University of Hong Kong 6

Data augmentation is critical to the success of modern deep learning techniques. In this paper, we propose Online Hyper-parameter Learning for Auto-Augmentation (OHL-Auto-Aug), an economical solution that learns the augmentation policy distribution along with network training. Unlike previous methods on auto-augmentation that search augmentation strategies in an offline manner, our method formulates the augmentation policy as a parameterized probability distribution, thus allowing its parameters to be optimized jointly with network parameters. Our proposed OHL-Auto-Aug eliminates the need of re-training and dramatically reduces the cost of the overall search process, while establishes significantly accuracy improvements over baseline models. On both CIFAR-10 and ImageNet, our method achieves remarkable on search accuracy, 60x faster on CIFAR-10 and 24x faster on ImageNet, while maintaining competitive accuracies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years have seen remarkable success of deep neural networks in various computer vision applications. Driven by large amounts of labeled data, the performances of networks have reached an amazing level. A common issue in learning deep networks is that the training process is prone to overfitting, with the sharp contrast between the huge model capacity and the limited dataset. Data augmentation, which applies randomized modifications to the training samples, have been shown to be an effective way to mitigate this problem. Simple image transforms, including random crop, horizontal flipping, rotation and translation, are utilized to create label-preserved new training data to promote the network generalization and accuracy performance. Recent efforts such as

[8, 34] further develop the transform strategy and boost the model to state-of-the-art accuracy on CIFAR-10 and ImageNet datasets.

Figure 1: The search cost of Auto-Augment [6] and our OHL-Auto-Aug. ‘#Iterations’ denotes the total training iterations with conversion under a same batch size 1024 (more details in Section 4). Our proposed OHL-Auto-Aug achieves a remarkable auto-augmentation search efficiency, which is 60 faster on CIFAR-10 and 24 faster on ImageNet than Auto-Augment [6].

However, it is nontrivial to find a good augmentation policy for a given task, as it varies significantly across datasets and tasks. It often requires extensive experience combined with tremendous efforts to devise a reasonable policy. Instinctively, automatically finding the process of data augmentation strategy for a target dataset turns into an alternative. Cubuk   [6]

described aa procedure to automatically search augmentation strategy from data. The search process involves sampling hundreds and thousands of policies, each trained on a child model to measure the performance and further update the augmentation policy distribution represented by a controller. Despite its promising empirical performance, the whole process is computationally expensive and time consuming. Particularly, it takes 15000 policy samples, each trained with 120 epochs. This immense amount of computing limits its applicability in practical environment. Thus

[6] subsampled the dataset to and for CIFAR-10 and ImageNet respectively. An inherent cause of this computational bottleneck is the fact that augmentation strategy is searched in an offline manner.

In this paper, we present a new approach that considers auto-augmentation as a hyper-parameter optimization problem and drastically improves the efficiency of the autoaugmentation strategy. Specifically, we formulate augmentation policy as a probability distribution. The parameters of the distribution are regarded as hyper-parameters. We further propose a bilevel framework which allows the distribution parameters to be optimized along with network training. In this bilevel setting, the inner objective is the minimization of vanilla train loss w.r.t network parameters, while the outer objective is the maximization of validation accuracy w.r.t augmentation policy distribution parameters utilizing REINFORCE [30]. These two objectives are optimized simultaneously, so that the augmentation distribution parameters are tuned alongside the network parameters in an online manner. As the whole process eliminates the need of retraining, the computational cost is exceedingly reduced. Our proposed OHL-Auto-Aug remarkably improves the search efficiency while achieving significant accuracy improvements over baseline models. This ensures our method can be easily performed on large scale dataset including CIFAR-10 and ImageNet without any data reduction.

Our main contribution can be summaried as follows: 1) We propose an online hyper-parameter learning approach for auto-augmentation strategy which treats each augmentation policy as a parameterized probability distribution. 2) We introduce a bilevel framework to train the distribution parameters along with the network parameters, thus eliminating the need of re-training during search. 3) Our proposed OHL-Auto-Aug dramatically improves the efficiency while achieving significant performance improvements. Compared to the previous state-of-the-art auto-augmentation approach [6], our OHL-Auto-Aug achieves 60 faster on CIFAR-10 and 24 faster on ImageNet with comparable accuracies.

2 Related Work

Figure 2: The framework of our OHL-Auto-Aug. We formulate each augmentation policy as a parameterized probability distribution, whose parameters are regarded as hyper-parameters. We propose a bilevel framework which allows the distribution parameters to be optimized simultaneously with the network parameters. In the inner loop, the network parameters are trained using standard SGD with augmentation sampling. In the outer loop, augmentation policy distribution parameters are trained using REINFORCE gradients with trajectory samples. At each time step, the network parameters with the highest accuracy are broadcasted to other trajectory samples.

2.1 Data Augmentation

Data augmentation lies at the heart of all successful applications of deep neural networks. In order to improve the network generalization, substantial domain knowledge is leveraged to design suitable data transformations. [16] applied various affine transforms, including horizontal and vertical translation, squeezing, scaling, and horizontal shearing to improve their model’s accuracy. [15]

exploited principal component analysis on ImageNet to randomly adjust colour and intensity values.

[31] further utilized a wide range of colour casting, vignetting, rotation, and lens distortion to train Deep Image on ImageNet.

There also exists attempts to learn data augmentations strategy, including Smart Augmentation [17], which proposed a network that automatically generates augmented data by merging two or more samples from the same class, and [29] which used a Bayesian approach to generate data based on the distribution learned from the training set. Generative adversarial networks have also been used for the purpose of generating additional data [22, 35, 1]. The work most closely related to our proposed method is [6]

, which formulated the auto-augmentation search as a discrete search problem and exploited a reinforcement learning framework to search the policy consisting of possible augmentation operations. Our proposed method optimizes the distribution on the discrete augmentation operations but is much more computationally economical benefiting from the proposed hyper-parameter optimization formulation.

2.2 Hyper-parameter Optimization

Hyper-parameter optimization was historically accomplished by selecting several values for each hyper-parameter, computing the Cartesian product of all these values, then running a full training for each set of values. [2] showed that performing a random search was more efficient than a grid search by avoiding excessive training with the hyper-parameters that were set to a poor value. [3] later refined this random search process by using quasi random search. [27] further utilized Gaussian processes to model the validation error as a function of the hyper-parameters. Each training further refines this function to minimize the number of sets of hyper-parameters to try. All these methods are “black-box” methods since they assume no knowledge about the internal training procedure to optimize hyper-parameters. Specifically, they do not have access to the gradient of the validation loss w.r.t. the hyper-parameters. To address this issue, [21, 9] explicitly utilized the parameter learning process to obtain such a gradient by proposing a Lagrangian formulation associated with the parameter optimization dynamics. [21]

used a reverse-mode differentiation approach, where the dynamics corresponds to stochastic gradient descent with momentum. The theoretical evidence of the hyper-parameter optimization in our auto-augmentation is

[9], where the hyper-parameters are changed after a certain number of parameter updates in a forward manner. The forward-mode procedure is suitable for the auto-augmentation search process with a drastically reduced cost.

2.3 Auto Machine Learning and Neural Architecture Search

Auto Machine Learning (AutoML) aims to free human practitioners and researchers from these menial tasks. Recently many advances focus on automatically searching neural network architectures. One of the first attempts

[37, 36]

was utilizing reinforcement learning to train a controller that represents a policy to generate a sequence of symbols representing the network architecture. The generation of a neural architecture was formulated as the controller’s action, whose space is identical to the architecture search space. An alternative to reinforcement learning is evolutionary algorithms, that evolved the topology of architectures by mutating the best architectures found so far

[24, 32, 25]. There also exists surrogate model based search methods [18] that utilized sequential model-based optimization as a technique for parameter optimization. However, all the above methods require massive computation during the search, particularly thousands of GPU days. Recent efforts such as [19, 20, 23], utilized several techniques trying to reduce the search cost. [19] introduced a real-valued architecture parameter which was jointly trained with weight parameters. [20] embedded architectures into a latent space and performed optimization before decoding. [23] utilized architectural sharing among sampled models instead of training each of them individually. Several methods attempt to automatically search architectures with fast inference speed, either explicitly took the inference latency as a constrain [28], or implicitly encoded the topology information [11]. Note that [6] utilized a similar controller inspired by [37], whose training is time consuming, to guide the augmentation policy search. Our auto-augmentation strategy is much more efficient and economical compared to these methods.

3 Approach

In this section, we first present the formulation of hyper-parameter optimization for auto-augmentation strategy. Then we introduce the framework of solving the optimization in an online manner. Finally we detail the search space and training pipeline. The pipeline of our OHL-Auto-Aug is shown in Figure 2.

3.1 Problem Formulation

The purpose of auto-augmentation strategy is to automatically find a set of augmentation operations performed on training data to improve the model generalization according to the dataset. In this paper, we formulate the augmentation strategy as , a probability distribution on augmentation operations. Suppose we have candidate data augmentation operations , each of which is selected under a probability of . Given a network model parameterized by , train dataset , and validation dataset , the purpose of data augmentation is to maximum the validation accuracy with respect to , and the weights associated with the model are obtained by minimizing the training loss:

(1)

where ,

denotes the loss function and the input training data is augmented by the selected candidate operations.

This refers to a bilevel optimization problem [5], where augmentation distribution parameters are regarded as hyper-parameters. At the outer level we look for a augmentation distribution parameter , under which we can obtain the best performed model , where is the solution of the inner level problem. Previous state-of-theart methods resort to sampling augmentation strategies with a surrogate model, and then solve the inner optimization problem exactly for each sampled strategy, which raises the time-consuming issue. To tackle this problem, we propose to train the augmentation distribution parameters alongside with the training of network weights, getting rid of training thousands of networks from scratch.

3.2 Online Optimization framework

Inspired by the forward hyper-gradient proposed in [9], we propose an online optimization framework that trains the hyper-parameters and the network weights simultaneously. To clarify, we use to denote the update steps of the outer loop, and to denote the update iterations of the inner loop. This indicates that between two adjacent outer optimization updates, the inner loop updates for steps. Each steps train the network weights with a batch size of images. In this section, we focus on the process of the -th period of the outer loop.

For each image, a augmentation operation is sampled according to the current augmentation policy distribution . We use trajectory to denote all the augmentation operations sampled in the -th period. For the iterations of the inner loop, we have

(2)

where is learning rate for the network parameters, denotes the batched loss which takes three inputs: the augmentation operations for the -th batch, the current inner loop parameters and a mini-batched data , , and returns an average loss on the -th batch.

As illustrated in Equation 3.2, is affect by the trajectory . Since is only influenced by the operations in trajectory ,

should be regarded as a random variable, with probability

, where denotes the probability of sampled by , calculated as . We update the augmentation distribution parameters by taking one step of optimization for the outer objective,

(3)

where is learning rate for the augmentation distribution parameters. As the outer objective is a maximization, here the update is ‘gradient ascent’ instead of ‘gradient descent’. The above process iteratively continues until the network model is converged.

The motivation of this online framework is that we approximate using only steps of training, instead of completely solving the inner optimization in Equation 1 until convergence. At the outer level, we would like to find the augmentation distribution that can train a network to a high validation accuracy given the -step optimized network weights. A similar approach has been exploited in meta-learning for neural architecture search [19]. Although the convergence are not guaranteed, practically the optimization is able to reach a fixed point as our experiments have demonstrated, as well in [19].

It is a tricky problem to calculate the gradient of validation accuracy with respect to in Equation 3. This is mainly due to two facts: 1. the validation accuracy is non-differentiable with respect to . 2. Analytically calculating the integral over is intractable. To address these two issues, we utilize REINFORCE [30] to approximate the gradient in Equation 3 by Monte-Carlo sampling. This is achieved by training networks in parallel for the inner loop, which are treated as sampled trajectories. We calculate the average gradient of them based on the REINFORCE algorithm. We have

(4)

where is the -th trajectory. By substituting to Equation 4,

(5)

In practice, we also utilize the baseline trick [26]

to reduce the variance of the gradient estimation,

(6)

where denotes the function that normalizes the accuracy returned by each sample trajectory to zero mean and unit variance.

Since we have ¥ trajectories, each of which outputs a from Equation 3.2, we need to synchronize these to get a same start point for the next training step. We simply select the with the best validation accuracy as the final output of Equation 3.2.

3.3 Search Space and Training Pipeline

In this paper, we formulate the auto-augmentation strategy as distribution optimization. For each input train image, we sample an augmentation operation from the search space and apply to it. Each augmentation operation consists of two augmentation elements. The candidate augmentation elements are listed in Table 1.

Elements Name range of magnitude
None
None
None
Table 1: List of Candidate Augmentation Elements

The total number of the elements is . A single operation is defined as the combination of two elements, so the search space has about operations, some operation may have the same affect to data. In our work, the distribution of augmentation operations is formulated as a joint probability of two elements,

(7)

where .

The whole training pipeline is described in Algorithm 1.

4 Experiments

4.1 Implementation Details

  Initialize , initialize the same for models;
  while  do
     for all  such that  do
         for all  such that  do
            Compute in Equation 3.2;
         end forreturn
     end forreturn
     Fix , calulate in Equation 6;
     Update according to Equation 3;
     Select from with the best validation accuracy;
     Broadcast to all the models;
  end while
  return , ;
Algorithm 1 Online Optimization for Auto-Augmentation Strategy
Model Baseline Cutout [8] Auto-Augment [6] OHL-Auto-Aug
Error Reduce
(Baseline/Cutout)
ResNet-18 [12] 4.66 3.62 3.46 3.29 1.37/0.33
WideResNet-28-10 [33] 3.87 3.08 2.68 2.61 1.26/0.47
DualPathNet-92 [4] 4.55 3.71 3.16 2.75 1.8/0.96
AmoebaNet-B(6, 128) [6]
2.98
(3.4)
2.13
(2.9)
1.75 1.89
1.09/0.24
(1.51/1.01)
Table 2: Test error rates (%) on CIFAR-10. The number in brackets refers to the results of our implementation. We compare our OHL-Auto-Aug with standard augmentation (Baseline), standard augmentation with Cutout (Cutout), augmentation strategy discovered by [6] (Auto-Augment). Compared to Baseline, our OHL-Auto-Aug achieves about 30% reduction of error rate.

We perform our OHL-Auto-Aug on two classification datasets: CIFAR-10 and ImageNet. Here we describe the experimental details that we used to search the augmentation policies.

CIFAR-10 CIFAR-10 dataset [14] consists of natural images with resolution 3232. There are totally 60,000 images in 10 classes, with 6000 images per class. The train and test sets contain 50,000 and 10,000 images respectively. For our OHL-Auto-Aug on CIFAR-10, we use a validation set of 5,000 images, which is randomly splitted from the training set, to calculate the validation accuracy during the training for the augmentation distribution parameters.

In the training phase, the basic pre-processing follows the convention for state-of-the-art CIFAR-10 models: standardizing the data, random horizontal flips with 50% probability, zero-padding and random crops, and finally Cutout

[8] with 1616 pixels. Our OHL-Auto-Aug strategy is applied in addition to the basic pre-processing. For each training input data, we first perfom basic pre-processing, then our OHL-Auto-Aug strategy, and finally Cutout.

All the networks are trained from scartch on CIFAR-10. For the training of network parameters, we use a mini-batch size of 256 and a standard SGD optimizer. The momentum rate is set to 0.9 and weight decay is set to 0.0005. The cosine learning rate scheme is utilized with the initial learning rate of 0.2. The total training epochs is set to 300. For the AmoebaNet-B [24], there are several hyper-parameters modifications. The initial learning rate is set to 0.024. We also use an additional learning rate warmup stage [10] before training. The learning rate is linearly increased from 0.005 to the initial learning rate 0.024 in 40 epochs.

ImageNet ImageNet dataset [7] contains 1.28 million training images and 50,000 validation images from 1000 classes. As our experiments on CIFAR-10, we set aside an additional validation set of 50,000 images splitted from the training dataset.

For basic data pre-processing, we follow the standard practice and perform the random size crop to 224224 and random horizontal flipping. The practical mean channel substraction is adopted to normalize the input images for both training and testing. Our OHL-Auto-Aug operations are performed following this basic data pre-processing.

For the training on ImageNet, we use synchronous SGD with a Nestrov momentum of 0.9 and 1e-4 weight decay. The mini-batch size is set to 2048 and the base learning rate is set to 0.8. The cosine learning rate scheme is utilized with a warmup stage of 2 epochs. The total training epoch is 150.

Augmentation Distribution Parameters For the training of augmentation policy parameters, we use Adam optimizer with a learning rate of , and . On CIFAR-10 dataset, the number of trajectory sample is set to 8. This number is reduced to 4 for ImageNet, due to the large computation cost.Having finished the outer update for the distribution parameters, we broadcast the network parameters with the best validation accuracy to other sampled trajectories using multiprocess utensils.

4.2 Results

CIFAR-10 Results: On CIFAR-10, we perform our OHL-Auto-Aug on three popular network architectures: ResNet [12], WideResNet [33], DualPathNet [4] and AmoebaNet-B [24]. For ResNet, a 18-layer ResNet with 11.17M parameters is used. For WideResNet, we use a 28 layer WideResNet with a widening factor of 10 which has 36.48M parameters. For Dual Path Network, we choose DualPathNet-92 with 34.23M parameters for CIFAR-10 classification. AmoebaNet-B is an auto-searched architecture by a regularized evolution approach proposed in [24]. We use the AmoebaNet-B (6,128) setting for fair comparison with [6], whose parameters is 33.4M. The WideReNet-28, DualPathNet-92 and AmoebaNet-B are relatively heavy architectures for CIFAR-10 classification.

We illustrate the test set error rates of our OHL-Auto-Aug for different network architectures in Table 2. All the experiments follow the same setting described in Section 4.1. We compare the empirical results with standard augmentation (Baseline), standard augmentation with Cutout (Cutout), an augmentation strategy proposed in [6] (Auto-Augment). For ResNet-18 and DualPathNet-92, our OHL-Auto-Aug achieves a significant improvement than Cutout and Auto-Augment. Compare to the Baseline model, the error rate is reduced about ResNet-18, about for DualPathNet-92 and about for WideResNet-28-10. For AmoebaNet-B, although we have not reproduce the accuracy reported in [6], our OHL-Auto-Aug still achieves a 1.09% error rate redunction compared to Baseline model. Compared to the baseline we implemented, our OHL-Auto-Aug achieve an error rate of which is about error rate reduction. For all the models, our OHL-Auto-Aug exhibits a significantly error rate drop compared with Cutout.

We also find that the results of our OHL-Auto-Aug outperform that of Auto-Augment on all the networks except AmoebaNet-B. We think the main reason is that we could not train the Baseline AmoebaNet-B to a comparable accuracy with the original paper. By applying our OHL-Auto-Aug, the accuracy gap can be narrowed from 0.42% ((Baseline comparing [6] and our implementation) to 0.14% (comparing OHL-Auto-Aug with Auto-Augment), which could demonstrate the effectiveness of our approach.

ImageNet Results: On ImageNet dataset, we perform our method on two networks: ResNet-50 [12] and SE-ResNeXt101 [13] to show the effectiveness of our augmentation strategy search. ResNet-50 contains 25.58M parameters and SE-ResNeXt101 contains 48.96M parameters. We regard ResNet-50 as a medium architecture and SE-ResNeXt101 as a heavy architecture to illustrate the generalization of our OHL-Auto-Aug.

Method ResNet-50 [12] SE-ResNeXt-101 [13]
Baseline 24.70/7.8 20.70/5.01
mixup [34] 23.3/6.6
Auto-Augment [6] 22.37/6.18 20.03/5.23 (our impl.)
OHL-Auto-Aug 21.07/5.68 19.30/4.57
Table 3: Top-1 and Top-5 error rates (%) on ImageNet. We compare our OHL-Auto-Aug with standard augmentation (Baseline), standard augmentation with mixup [34] (mixup), augmentation strategy discovered by [6] (Auto-Augment). For both the ResNet-50 and SE-ResNeXt-101, our OHL-Auto-Aug improves the performance significantly.
Dataset Auto-Augment [6] OHL-Auto-Aug
Usage of
Dataset (%)
Usage of
Dataset (%)
CIFAR-10
ImageNet
No Need Retrain
Table 4: Top-1 and Top-5 error rates (%) on ImageNet. We compare our OHL-Auto-Aug with standard augmentation (Baseline), standard augmentation with mixup [34] (mixup), augmentation strategy discovered by [6] (Auto-Augment). For both the ResNet-50 and SE-ResNeXt-101, our OHL-Auto-Aug improves the performance significantly.

In Table 3, we show the Top-1 and Top-5 error rates on validation set. We notice there exists another augmentation stategy mixup [34] which trains network on convex combinations of pairs of examples and their labels and achieves performance improvement on ImageNet dataset. We report the results of mixup together with Auto-Augment for comparision. All the experiments are conducted following the same setting described in Section 4.1. Since the original paper of [6] does not perform experiments on SE-ResNeXt101, we train it with the searched augmentation policy provided in [6]. As can be seen from the table, OHL-Auto-Aug achieves state-of-the-art accuracy on both models. For the medium model ResNet-50, our OHL-Auto-Aug increases by 3.7% points over the Baseline, 1.37% points over Auto-Augment, which is a significant improvement. For the heavy model SE-ResNeXt101, our OHL-Auto-Aug still boosts the performance by 1.4% even with a very high baseline. Both of the experiments demonstrate the effectiveness of our OHL-Auto-Aug.

Search Cost: The principle motivation of our OHL-Auto-Aug is to improve the efficiency of the search of auto-augmentation strategy. To demonstrate the efficiency of our OHL-Auto-Aug, we illustrate the search cost of our method together with Auto-Augment in Table 4. For fair comparison, we compute the total training iterations with conversion under a same batch size 1024 and denote as ‘’. For example, Auto-Augment perfoms searching on CIFAR-10 by sampling 15,000 child models, each trained on a ‘reduced CIFAR-10’ with 4,000 randomly chosen images for 120 epochs. So the of Auto-Augment on CIFAR-10 could be calulated as . Similarly, Auto-Augment samples child models and trains them each on a ‘reduced ImageNet’ of 6,000 images for 200 epochs, which equals to . Our OHL-Auto-Aug trains on the whole CIFAR-10 for 300 epochs with 8 trajectory samples, which equals to . For ImageNet, the is .

As illustrated in Table 4, even training on the whole dataset, our OHL-Auto-Aug achieves 60 faster on CIFAR-10 and 24 faster on ImageNet than Auto-Augment. Instead of training on a reduced dataset, the high efficiency enables our OHL-Auto-Aug to train on the whole dataset, a guarantee for better performance. This also ensures our OHL-Auto-Aug algorithm could be easily performed on large dataset such as ImageNet without any bias on input data. Our OHL-Auto-Aug also gets rid of the need of retraining from scratch, further saving computation resources.

4.3 Analysis

In this section, we perform several analysis to illustrate the effectiveness of our proposed OHL-Auto-Aug.

Figure 3: Comparisons with different augmentation distribution learning rates. All the experiments are conducted with ResNet-18 on CIFAR-10 following the same setting as described in Section 4.1 except . As can be observed, too large will make the distribution parameters difficult to converge, while too small will slow down the convergency process, both harming the final performance. The chosen of is a trade-off between convergence speed and performance.

Comparisons with different augmentation distribution learning rates: We conduct our OHL-Auto-Aug with different augmentation distribution learning rates: . All the experiments use ResNet-18 on CIFAR-10 following the same setting as described in Section 4.1 except . The final accuracies are illustrated in Figure 3. Note that our OHL-Auto-Aug chooses .

Comparing the result of with other results, we find that updating the augmentation distribution could produce better performed network than uniformly sampling from the whole search space. As can be observed, too large will make the distribution parameters difficult to converge, while too small will slow down the convergency process. The chosen of is a trade-off for convergence speed and performance.

Figure 4: Comparisons with different number of trajectory samples. All the experiments use ResNet-18 on CIFAR-10 following the same setting as described in Section 4.1 except . As can be obseved, increasing the number of trajectories from 1 to 8 steadily improves the performance of the network. When the number comes to , the accuracy improvement is minor. We select on CIFAR-10 as a trade-off between computation cost and performance.

Comparisons with different number of trajectory samples: We further conduct an analysis on the sensetivity of the number of number of trajectory samples. We produce experiments with ResNet-18 on CIFAR-10 with different number of trajectories: . All the experiments follows the same setting as described in Section 4.1 except . The final accuracies are illustrated in Figure 4.

Principly, a larger number of trajectories will benefit the training for the augmentation distribution parameters due to the more accurate gradient calculated by Equation 4. Meanwhile, larger number of trajectories increases the memory requisition and the computation cost. As obseved in Figure 4, increasing the number of trajectories from 1 to 8 steadily improves the performance of the network. When the number comes to , the accuracy improvement is minor. We select on CIFAR-10 as a trade-off between computation cost and performance.

Analysis of augmentation distribution parameters:

We visualize the augmentation distribution parameters of the whole training stage. We calculate the marginal distribution parameters of the first element in our augmentation operation, which is computed by summing up each row vector of

following normalization. The results are illustrated in Figure 5(a) and Figure 5(b) for CIFAR-10 and ImageNet respectively. On CIFAR-10, we choose the network of ResNet-18 and show the results for every 5 epochs. On ImageNet we choose ResNet-50 and show the results for every 2 epochs. For both the two settings, we find that during training, our augmentation distribution parameters has converged. Some augmentation operations have been discarded, while the probabilities of some others have increased. We also notice that for different dataset, the distributions of augmentation operations are different. For instance, on CIFAR-10, the 36th operation, which is the ‘Color Invert’, is discarded, while on ImageNet, this operation is preserved with high probability. This further demonstrates our OHL-Auto-Aug could be easily performed on different datasets to search different augmentation policies.

(a) Visualization of augmentation distribution parameters of ResNet-18 training on CIFAR-10.
(b) Visualization of augmentation distribution parameters of ResNet-50 training on ImageNet.
Figure 5: Analysis of augmentation distribution parameters. For both the two settings, we find that during training, our augmentation distribution parameters has converged. Some augmentation operations have beed discarded, while the probabilities of some others have increased. We also notice that for different dataset, the distributions of augmentation operations are different.

5 Conclusion

In this paper, we have proposed and online hyper-parameter learning method for auto-augmentation strategy, which formulates the augmentation policy as a parameterized probability distribution. Benefitting from the proposed bilevel framework, our OHL-Auto-Aug is able to optimize the distribution parameters iteratively with the network parameters in an online manner. Experimental results illustrate that our proposed OHL-Auto-Aug achieves remarkable auto-augmentation search efficiency, while establishes significantly accuracy improvements over baseline models, on large scale dataset including CIFAR-10 and ImageNet without any data reduction.

References