Data augmentation is a technique to create synthetic data from existing data with controlled perturbation. For example, in the context of image recognition, data augmentation refers to applying image operations, , cropping and flipping, to input images to generate augmented images, which have labels the same as their originals. In practice, data augmentation has been widely used to improve the generalization in deep learning models and is thought to encourage model insensitivity towards data perturbation[krizhevsky2012imagenet, he2016deep, huang2017densely]. Although data augmentation works well in practice, designing data augmentation strategies requires human expertise, and the strategy customized for one dataset often works poorly for another dataset. Recent efforts have been dedicated to automating the design of augmentation strategies. It has been shown that training models with a learned data augmentation policy may significantly improve test accuracy [fastautoaug, advaa, randaug, pba, fasterautoaug].
However, we do not yet have a good theory to explain how data augmentation improves model generalization. Currently, the most well-known hypothesis is that data augmentation improves generalization by imposing a regularization effect: it regularizes models to give consistent outputs within the vicinity of the original data, where the vicinity of the original data is defined as the space that contains all augmented data after applying operations that do not drastically alter image features [mixup, kernelda, wu2020generalization]. Meanwhile, previous automated data augmentation works claim that the performance gain from applying learned augmentation policies arises from the increase in diversity [autoaug, randaug, pba]. However, the “diversity” in the claims remains a hand-waving concept: it is evaluated by the number of distinct sub-policies utilized during training or visually evaluated from a human perspective. Without formally defining diversity and its relation to regularization, the augmentation strategies can only be evaluated indirectly by evaluating the models trained on the augmented data, which may cost thousands of GPU hours [autoaug]. It motivates us to explore the possibility of using an explicit diversity measure to quantify the regularization effect of the augmented data may have on the model. Thus, in this way we can directly maximize the diversity of the augmented data to strengthen the regularization effect to improve the generalization of the model.
To bridge the gap, in this paper we propose a new diversity measure, called Variance Diversity, to quantify the diversity of augmented data. We show that the regularization effect of data augmentation is promised by Variance Diversity. Our measure is motivated by the recent theoretical result that after applying augmented data to train the model, the loss implicitly contains a data-driven regularization term that is in proportion to the variance of probability vectors, where probability vectors are the outputs from models trained with the augmented data [kernelda]. Specifically, we measure the diversity of a set of augmented data by the variance of their corresponded probability vectors. Based on the measure, we propose a plug-in automated data augmentation framework named DivAug, which can plug in the standard training process without requiring a separate search process. As illustrated in Figure 1, the framework has two stages: the expanding stage, where we randomly generate several augmented data for each original input data, and the selection stage, where we sub-sample a subset of augmented data and feed them to train the model. Specifically, at the selection stage, for each image, we sub-sample a subset of augmented images with high diversity by applying the -means++ seeding algorithm [kppseeding], where the augmented data accompanied with probability vector which is far away from that of the original data is sampled with high probability. Following the mathematical derivation, the regularization effect increases with the diversity of the augmented data. Consequently, the stronger regularization effect can lead to better model generalization, which is observed in terms of improved model performance. Our main contributions can be summarized as follows:
We propose a new measure for quantifying the diversity of augmented data. We validate in our experiments that the relative gain in the accuracy of a model after applying data augmentation is highly correlated to our proposed measure.
Based on the proposed measure, we design a sampling-based framework to explicitly maximize diversity. Without requiring a separate search process, the performance gain from DivAug is comparable to the state-of-the-art method with better efficiency.
Our method is unsupervised and can plug in the standard training process. We show that our method can further boost the performance of the semi-supervised learning algorithm, making it highly applicable to real-world problems, where labeled data is scarce.
2 Related Work
Recently, AutoAugment (AA) [autoaug]
has been proposed to automatically search for augmentation policies from a dataset. Specifically, AutoAugment utilizes a recurrent neural network (RNN) as the controller to find the best policy in a separate search process on a small proxy task (smaller model size and dataset size). Once the search process is over, the learned policies are transferred to the target task and fixed during the whole training process. These learned augmentation policies significantly improve the generalization of deep models[autoaug]
. However, its search time is huge: it costs roughly 5,000 GPU hours to search for the best policies on a smaller dataset they call “reduced CIFAR-10”, which consists of 4,000 randomly chosen images.
|Method||non-fixed||without the separate search process||unsupervised||without proxy tasks|
|Fast AA [fastautoaug]||✗||✗||✗||✓|
|Adv. AA [advaa]||✓||✗||✗||✓|
|DivAug (this paper)||✓||✓||✓||✓|
Most of the following works adopted the AutoAugment search space and formulation with improved optimization algorithms [advaa, fastautoaug, pba, fasterautoaug]. Population-based augmentation (PBA) [pba] replaces the fixed policy with a dynamic schedule of policies evolving along with the training process. Fast AutoAugment (Fast AA) [fastautoaug] proposes a “density match” method to accelerate the search process and treats the augmented data as missing points in the training set. RandAugment (RA) [randaug] eliminates the separate search process by randomly applying augmentation sub-policies, which best resembles our work. Adversarial AutoAugment (Adv. AA) [advaa] achieves state-of-the-art results by utilizing an RNN controller to learn policies that could generate augmented data with higher loss. As shown in Table 1, we outline a general taxonomy of automated data augmentation methods, characterized by four core properties. Non-fixed: augmentation policies are dynamically changed along with the training process; without the separate search process: methods do not require a separate search process; unsupervised: methods do not require label information to find the best policy; and without proxy tasks: methods perform the search directly on target tasks.
In this section, we introduce the design and implementation of DivAug. First, we describe our search space in Section 3.1. Then we mathematically show that after employing augmented data, the training loss implicitly contains a data-driven regularization term that is in proportion to the variance of probability vectors (Section 3.2). Subsequently, we propose to measure the diversity of a set of augmented data by the variance of their corresponded probability vectors. Based on the measure, we derive a sampling-based automated data augmentation method to explicitly maximize the diversity of augmented data (Section 3.3).
3.1 Search Space
We adopt the basic structure of the well-designed search space introduced in AutoAugment [autoaug]. There are totally 16 image operations in our search space, including Sharpness, ShearX/Y, TranslateX/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Color, Brightness, Cutout [cutout], Sample Pairing [samplepairing], and Contrast. Let be the set of all available operations. Each operation has two parameters: , the probability of applying the operation; and , the magnitude of the operation. To avoid creating confusion in notations, we use to represent image transformation specified by , with magnitude . Given an image , the operation is defined as:
Each operation comes with a maximum range of magnitudes to avoid extreme image transformations. For example, Rotate operation is only allowed to rotate images at most 30 degrees. The maximum range of magnitude for each operation is set to be the same as those reported in the AutoAugment. Meanwhile, we normalize the magnitude parameter to within , where stands for the maximum acceptable magnitude. One example for illustrating the operation is shown in Figure 2.
In general, previous automated data augmentation methods search for the top augmentation policy, which is a set of five sub-policies, with each sub-policy consisting of two operations to be applied to the original images in sequence. Let be the sub-policy that consists of two consecutive operations, namely, . For the sake of description convenience, we simplify the notation as . Given the search space, previous automated data augmentation methods explore and rank the possible policy candidates in a separate search process. Once the search process is over, the top five policies are collected to form a single final policy, which is a set containing 25 distinct sub-policies. The final policy is fixed throughout the training process. For each image in a mini-batch, only one sub-policy will be randomly selected to be applied [autoaug].
However, the fixed policy may be sub-optimal due to the following two factors. First, there does not exist a sub-policy universally better than all other sub-policies throughout the training process [timematters, pba, advaa]. For example, sub-policies that can reduce generalization error at the end of training is not necessarily a good sub-policy at the initial phase [affinity_diversity]. Second, the choices (hence diversity) of the augmented data is limited by the fixed set of unique sub-policies. From the above analysis, we design our search space similar to the AutoAugment’s search space with two differences. First, inspired by Fast AutoAugment [fastautoaug], to introduce more stochasticity, we relax both the probability and magnitude as continuous parameters with value range . Second, the final policy in our search space is defined as the universal set that contains all the possible sub-policies. In contrast, the final policy in other work’s search space is set to a fixed set of 25 unique sub-policies. We note that RandAugment [randaug] samples the sub-policies uniformly over the search space similar to ours. The major distinctions in RandAugment are 1) the magnitude parameter is fixed discrete integer value, 2) the probability parameter is fixed to 1. That means RandAugment always applies operations on the original data†††We note that although RA always applies operations, RA actually may keep the original image unchanged since its search space contains an identity operation..
3.2 Regularization Effects of Data Augmentation
We derive the regularization effect of data augmentation following from the theoretical analysis in [kernelda]
. We start by introducing the setting and notations of representation learning. Consider a neural networkparameterized by . map the input into a vector representation with
output dimensions. We aim to minimize loss functionsover a dataset , where . Let Softmax() be the probability vector, where the Softmax function is used to normalize
into a probability distribution. We denote the loss function to be minimized as, where . We denote the gradient of with respect to the first argument as . Similarly, we use to represent the Hessian matrix of with respect to the first argument. We use to represent the sub-policy, and is the set of all available sub-policies. is the augmented data in the vicinity of obtained by applying to . We use to denote inner-product. For a set , we use to represent its cardinality. With these notations, after applying data augmentation, the new loss function becomes:
Suppose data augmentation does not significantly modify the feature map. Using the first order Taylor approximation, we can expand Equation (1) around point :
The second term in Equation (3.2) can be cancelled by picking , , is the averaged probability vector of all samples within the vicinity of . If we further expand Equation (1) around point by considering the second order term, we have:
is the difference between the probability vector referring to the augmented data , and the averaged probability vector . The second term in Equation (3) is so called the “data-driven regularization term”, which is exact the variance of the probability vector , weighted by . That means employing augmented data imposes a regularization effect by implicitly controlling the variance of model’s outputs.
3.3 The DivAug Framework
To establish the relationship between the diversity of augmented data and their regularization effect, we propose a new diversity measure, called Variance Diversity, for the augmented data whose regularization effect can be quantified. Based on this, we derive a sampling-based framework that explicitly maximizes the Variance Diversity of the augmented data.
3.3.1 Diversity Measure of Augmented Data
We start by proposing a new diversity measure for augmented data, whose regularization effect can be quantified.
From Equation (3), after training models on augmented data, a data-driven regularization term can be decomposed from the loss function. From above, we quantify the diversity of a set of augmented data by the variance of their corresponding probability vectors. Formally, given a model , for a set of augmented data , where is generated from the same original data by applying different sub-policy , we define the diversity of as:
Softmax() is the probability vector corresponding to , and . If CrossEntropy is used as the loss function, then the Hessian matrix is a diagonal matrix, where the elements on the diagonal are all zero, except for the one corresponding to the true label. This implies that under the supervised setting, only the variance of the probability associated with the true label will be penalized. We can extend this penalty effect to unsupervised domain by setting in Equation (3
) as the identity matrix. In this way, Equation (3) penalizes the variance of the probability associated with any class. We note that this is essentially the consistency regularization, which is one of the key techniques in semi-supervised learning and self-supervised learning, which encourages the model to produce similar probability vectors when the input data is perturbed by noise [uda, mixmatch]. Moreover, if in Equation (3) is set as the identity matrix, the diversity of augmented data is exact the data-driven regularization term in Equation (3).
According to Equation (4), we name our diversity measure “Variance Diversity”. We note that this is a unsupervised model-specific measure, which depends only on the model prediction without involving any label information. Intuitively, as illustrated in Figure 3, if a set of augmented data has large Variance Diversity, that means their corresponding probability vectors are far away from each other. Therefore, it is harder for models to give consistent predictions for diversely augmented data. This forces the models to generalize over the vicinity of original data.
3.3.2 Design of DivAug
According to the definition of Variance Diversity and Equation (3), the increase of Variance Diversity directly strengthens the regularization effect of augmented data. Based on this insight, our DivAug framework generates a set of diversely augmented data and minimizes the loss over them. Specifically, DivAug consists of two stages: the expanding stage and the selection stage. At the expanding stage, for each original data , we first randomly generate a set of sub-policies , where are the set of augmented data corresponding to . The second stage is the selection stage, where we sub-sample a subset of augmented data , where . Then we feed the selected augmented data to the model. Our DivAug framework is illustrated in Figure 1. Formally, with the notations introduced in Section 3.2 and Section 3.3.1, given , we minimize the following objective:
where . From Equation (6), we target at selecting a subset of augmented data , whose corresponded probability vectors have maximum variance. Unfortunately, getting the solution of Equation (6) poses a significant computational hurdle. Instead of computing the optimal solution, we efficiently sample with the -means++ seeding algorithm [kppseeding], which is originally made to generate a good initialization for -means clustering. -means++ seeding selects centroids by iteratively sampling points in proportion to their squared distances from the closest centroid that has been chosen. Here, we define the distance between a pair of probability vector as their Euclidean distance. Therefore, -means++ samples a subset of augmented data where their probability vectors are far apart from each other, which practically leads to a large Variance Diversity. For more details, the -means++ seeding algorithm is shown in Algorithm 1 in the Appendix A. We show the algorithm of DivAug in Algorithm LABEL:algo:divaug
and remark that the operation is randomly generated. There are two hyperparameters in AlgorithmLABEL:algo:divaug. Namely, the number of augmented images per input image , and the number of selected augmented images per input image used for training . Moreover, the two hyperparameters and do not need to be tuned on proxy tasks and can be chosen according to available computation resources. Similar to RandAugment, DivAug is a sampling-based method that does not require a separate search process. Note that there is no label information involved in Algorithm LABEL:algo:divaug, which means DivAug is suitable for both semi-supervised learning and supervised learning.
Our experiments aim to answer the following research questions:
RQ1. What is the effect of Variance Diversity on model generalization?
RQ2. How effective is the proposed DivAug compared with other automated data augmentation methods under the supervised settings?
RQ3. How well does DivAug improve the performance of semi-supervised learning algorithms?
4.1 Experimental Settings
Below, we first introduce the datasets and the default augmentation method for them. Then, we will introduce the hyperparameter setting of Divaug ( and in Algorithm LABEL:algo:divaug) , and the baseline methods for comparison.
, excluding the ImageNet experiment. For ImageNet, we setand due to limited resources. For the semi-supervised learning experiment, we set and . We did not tune these two hyperparameters, and we choose them mainly according to the available GPU memory.
The methods for comparison are as below: We compare Algorithm LABEL:algo:divaug with AutoAugment (AA) [autoaug], Fast AutoAugment (Fast AA) [fastautoaug], Population Based Augmentation (PBA) [pba], RandAugment (RA) [randaug] and Adversarial AutoAugment (Adv. AA) [advaa]. For each image, the augmentation policy proposed by different methods and the default augmentation are applied in sequence.
4.2 Correlation Between Variance Diversity and Generalization
To answer RQ1, we calculate the Variance Diversity of augmented data generated by AA, Fast AA, RA, the default augmentation introduced in Section 4.1, and DivAug‡‡‡We do not include Adv. AA because the official code is not released. For PBA, the official code is based on Ray and hard to migrate our codebase for a fair comparison.. Then, we report the test accuracy of models trained on augmented data generated by different methods.
Because Variance Diversity is an unsupervised, model-specific measure, for a fair comparison, we first train a Wide-ResNet-40-2 model on CIFAR-10 without applying any data augmentation methods. Then we use it as the in Equation (4) to evaluate all different automated data augmentation methods. To verify the correlation between generalization and Variance Diversity, we calculate the Variance Diversity of augmented data as follows: for each image in the training set, an automated augmentation method is used to randomly generate four augmented images. Then we calculate the Variance Diversity of these four images according to Equation (4). We report the averaged Variance Diversity over the entire training set in Figure 4.
Figure 4 demonstrates the performance gain and Variance Diversity are positively correlated (the detailed test accuracy is shown in the first row of Table 2). As shown in the figure, all automated data augmentation methods could improve the Variance Diversity of augmented data over the default augmentation. Specifically, AA and Fast AA has small Variance Diversity. It makes sense because both of them try to minimize the distribution shift of the augmented data from the original distribution. For example, Fast AA treats the augmented data as the missing point in the training set. As a result, for CIFAR-10, all of the reported sub-policy proposed by AA and Fast AA do not contain the counter-intuitive operation SamplePair [autoaug, fastautoaug], which limits the Variance Diversity of the augmented data generated by them. In contrast, DivAug has the largest Variance Diversity because it tries to explicitly maximize the Variance Diversity of the augmented data. Notice RA has larger Variance Diversity compared to AA and Fast AA. This might be a result of RA randomly sample operations. As a result, RA samples more distinct sub-policies than AA and Fast AA do and leads to larger diversity. Here we remark that although RA has larger Variance Diversity compared to AA and Fast AA, the model’s relative gain in accuracy is smaller compared to those of AA and Fast AA. We provided a detailed analysis in the Appendix C.
4.3 The Effectiveness of DivAug Under the Supervised Settings
|Dataset||Model||Baseline||AA||Fast AA||PBA||RA||Adv. AA||DivAug|
|[rgb]0.2,0.2,0.2Shake-Shake (26 2x96d)||97.1||98.0||98.0||98.0||98.0||98.1||98.1.1|
|[rgb]0.2,0.2,0.2Shake-Shake (26 2x96d)||82.9||85.7||85.1||84.7||-||85.9||85.3.2|
The main propose of automated data augmentation is to further improve the generalization of models over traditional data augmentation techniques. To answer RQ2, we compare our proposed method with several baselines under the supervised learning settings.
4.3.1 Experiment on CIFAR-10 and CIFAR-100
Following [autoaug, fastautoaug, randaug], we evaluate our proposed method with the following models: Wide-ResNet-28-10, Wide-ResNet-40-2 [wideresnet], Shake-Shake (26 2x96d) [shakeshake], and PyramidNet+ShakeDrop [shakedrop, pyramidnet]. The details of hyperparameters are shown in Appendix Table 5.
CIFAR-10 Results: In Table 2, we report the test accuracy of these models. For all of these models, our proposed method can achieve better performance compared to previous methods. We achieve , , , improvement on Wide-ResNet-28-10 compared to AA, Fast AA, PBA and RA, respectively. Overall, DivAug significantly improves the performances over baselines while achieves comparable performances to those of Adv. AA.
The effect of -means++ : As illustrated in Section 3.1, we remark that RA basically samples sub-policies uniformly in our search space. In contrast, DivAug samples sub-policies using -means++ seeding algorithm, which pushes the augmented data farther away from each other in the decision space of a given model. Thus, RA can be viewed as the random version of DivAug: the sub-policies picked by RA has an identical percentage of different operations throughout the training process. To understand the effect of -means++ and how DivAug improves the test accuracy over RA, we further visualize the distribution of sub-policies selected by DivAug with Wide-ResNet-40-2 on CIFAR-10 over the training process. We found that the percentages of some operations picked from the sampled sub-policies, such as TranslateY, ShearY, Posterize, and SampleParing, gradually increase along with the training process. In contrast, some color-based operation, such as Invert, Brightness, AutoContrast, and Color, gradually decrease along with the training process. In Figure 5, we plot the statistics of the two most contrasting operations which exhibit said phenomena, namely, Posterize and Invert. This behavior is consistent with the discovery that there does not exist an operation beating all other operations throughout the training process [pba, advaa]. Also, the average probability of applying operations in the selected sub-policies slowly increases with the training process. That means DivAug tends to mildly shift the distribution of augmented images away from the original one over the training process. From above, it suggests that the sub-policies selected by DivAug evolve throughout the training process.
Training Efficiency Analysis:
DivAug is estimated to be significantly faster than Adv. AA for the following reasons. Following the time cost metric in[wu2020generalization], we estimate the inference cost (see Algorithm LABEL:algo:divaug line 7) equals half of the training cost. Under the setting of and , DivAug additionally generates four times more augmented data for training. In contrast, Adv. AA needs to generate eight times more augmented data to achieve the results reported in Table 2. Moreover, it also needs a separate phase to search for the best policy. Although the search time for Adv. AA is not reported in [advaa]. The estimated costs are summarized in Table 3.
|Training()||1.0||8.0 + Search Cost||4.5|
CIFAR-100 Results: As shown in Table 2, DivAug generally achieves non-trivial performance gain over all other methods excluding Adv. AA. However, we note that DivAug does not require label information or a separate search process. Also, DivAug is significantly faster than Adv. AA.
4.3.2 Experiment on ImageNet
Following [autoaug, fastautoaug, randaug], we select ResNet-50 [he2016deep] to evaluate our proposed method. The details of the hyperparameters are shown in Appendix Table 5. As shown in Table 2, DivAug outperforms other baselines except Adv. AA. We remark that due to the limited resources, the two hyperparameters in Algorithm LABEL:algo:divaug are set to and , respectively. The performance gain from DivAug is expected to be further improved with larger and .
4.4 The Effectiveness of DivAug Under the Semi-Supervised Setting
One of the key techniques in semi-supervised learning [chapelle2009semi] (SSL) is consistency regularization, which encourages the model to produce similar probability vectors when the input data is perturbed by noise. It has been proven that the augmented data produced by state-of-the-art automated methods can serve as a superior source of noise under the consistency regularization framework [uda, fixmatch]. Specifically, UDA [uda] utilizes RA as the source of perturbation and achieves non-trivial performance gain. Also, it has been theoretically shown that the success of UDA stems from the diversity of augmented data generated by RA [uda].
However, most automated data augmentation methods require label information to search for the best policy. Thus, this prerequisite limits their application in SSL. In contrast, our proposed method is suitable for SSL because it is unsupervised and tries to explicitly maximize diversity. This leads to the following question: can SSL benefit from our proposed DivAug (RQ3)? To answer this question, following UDA, we change the source of perturbation from RA to DivAug (detailed hyperparameters are shown in the Appendix). Here, we report the averaged results over four trials. As shown in Table 4, DivAug can further boost the performance of UDA under different settings. Moreover, the performance gap grows larger when there is less labeled data available. This might be because, when there is limited labeled data, the regularization effect brought by diversity plays a much bigger role in model performance.
In this work, we propose a new diversity measure called Variance Diversity by investigating the regularization effect of data augmentation. We validate in experiments that the performance gain from automated data augmentation is highly correlated to Variance Diversity. Based on this measure, we derive the DivAug framework to explicitly maximize Variance Diversity during training. We demonstrate our proposed method has the practical utility of achieving better performance without the need to search for top policies in a separate phase. Therefore, DivAug can benefit both the supervised tasks and the semi-supervised tasks.
Appendix A -means++ Seeding Algorithm
As shown in Algorithm 1, the core idea of -means++ seeding algorithm is to sample centers sequentially, where each new center is sampled with probability proportional to the squared distance to its nearest center. The set of centers returned by Algorithm 1 is theoretically guaranteed to far away from each others [kppseeding].
|Dataset||Model||Batch Size||LR||WD||Epoch||LR Schedule|
|[rgb]0.2,0.2,0.2Shake-Shake (26 2x96d)||128||0.2||1e4||600||cosine|
|[rgb]0.2,0.2,0.2Shake-Shake (26 2x96d)||128||0.1||5e4||1200||cosine|
Appendix B The details about the benchmark datasets
The detailed statistic and the default data augmentation for the benchmark datasets are listed as belows.
CIFAR-10 & CIFAR-100 [cifardataset]: The training sets of the two datasets are composed of 50,000 colored images with 10 and 100 classes, respectively. Each image in these two datasets is in size of
. For CIFAR datasets, the default augmentation crops the padded image at a random location, and then horizontally flips it with the probability of 0.5. Then, it applies Cutout[cutout] to randomly select a patch of the image, and set the pixels within the selected patch as zeros.
SVHN [svhn]: This dataset contains color house-number images with 73,257 core images for training and 26,032 digits for testing. The default augmentation crops the padded image at a random location. Then it applies Cutout to randomly select a patch of the image, and set the pixels within the selected patch as zeros.
ImageNet [imagenet]: ImageNet includes colored images of 1,000 classes. The training set has roughly 1.2M images, and the validation set has 50,000 images. The default augmentation randomly crops and resizes images to a size of , and then horizontally flips it with a probability of 0.5. Subsequently, it performs ColorJitter and PCA to the flipped image [krizhevsky2012imagenet].
Appendix C Detailed Analysis For The Correlation between Variance Diversity and Generalization
Recently, two measures, Affinity and Diversity, are introduced in [affinity_diversity] for quantifying distribution shift and augmentation diversity, respectively. Across several benchmark datasets and models, it has been observed that the performance gain from data augmentation can be predicted not by either of these alone but by jointly optimizing the two [affinity_diversity]. Specifically, Affinity quantifies how much a sub-policy shifts the training data distribution from the original one. For a set of augmented data, our proposed diversity measure is calculated based on the variance of their probability vectors. Meanwhile, the diversity measure proposed in [affinity_diversity] is defined as the training loss of a given model over the augmented data. Below, we give the formal definition of Affinity and Loss Diversity:
Definition 1 (Affinity [affinity_diversity]).
Let and be training and validation datasets drawn i.i.d. from the same clean data distribution, and let be derived from by applying a stochastic augmentation strategy, , once to each image in , . Further let be a model trained on and denote the model’s accuracy when evaluated on dataset D. The affinity is defined as:
Definition 2 (Loss Diversity [affinity_diversity]).
Let be the training set, and be the augmented training set resulting from applying a stochastic augmentation strategy . For a set of augmented data , where is obtained by applying to , stochastically. Further, given a model which is trained on , let be the training loss corresponding to . The Loss Diversity between , , is defined as:
As we analyzed, given a set of augmented data which has large Variance Diversity, it is hard for models to give consist predictions for them, which will result in a large training loss. Thus, Loss Diversity and Variance Diversity are highly correlated. The main difference between them is that Variance Diversity is a unsupervised measure, , Variance Diversity is not related to the label information.
We further plot the performance gain from each augmentation methods against the Affinity, Loss Diversity, and Variance Diversity of the augmented data generated by them in Figure 6. In the legend, the marker size indicates the test accuracy of a Wide-ResNet-40-2 model trained with different automated data augmentation methods (The detailed results are shown in the first row of Table 2). Figure 6 demonstrates the Loss Diversity and Variance Diversity are highly correlated, which is consistent with our theoretical analysis. Following [affinity_diversity], we show the Affinity and Variance Diversity of augmented data generated by different methods in Figure 6 (b). There is a clear trend that the Loss Diversity and Variance Diversity contradict with the Affinity to some extent. We remark that although RA has larger Variance Diversity than AA and Fast AA, the performance gain from RA is smaller. According to the hypothesis in [affinity_diversity], this can be explained by RA has smaller Affinity than those of AA and Fast AA. In contrast, although DivAug has the largest Variance Diversity, largest Loss Diversity, and the smallest Affinity, DivAug performs best in terms of the test accuracy. We hypothesize that there might exist a sweet spot between the Diversity and Affinity, and how to achieve this sweet spot is a interesting future direction for the automated data augmentation methods.
Appendix D Experiment Details
For the semi-supervised learning experiment in Section 4.4,
we follow the settings in [uda] and employ Wide-ResNet-28-
2 [wideresnet] as the backbone model and evaluate UDA [uda] with varied supervised data sizes.
For the experiments on CIFAR-10 with supervised data size 1000, 2000, and 4000, the hyperparameters of them are identical as below:
we train the backbone model for 200K steps.
We use a batch size of 32 for labeled data and a batch size of 448 for unlabeled data.
The softmax temperature is set to 0.4.
The confidence threshold is set to 0.8.
The backbone model is trained by a SGD optimizer with learning rate of 1e4, weight decay of 5e 4, and
the nesterov momentum with the momentum hyperparameter set to 0.9.
We remark that all hyperparameters are identical to those reported in
4, and the nesterov momentum with the momentum hyperparameter set to 0.9. We remark that all hyperparameters are identical to those reported in[uda], except two differences: we train the backbone model for 200K steps instead of 500K, and we do not apply Exponential Moving Average to the parameters of backbone model.