1 Introduction
Data augmentation has emerged as an effective data preprocessing or data transformation step to mitigate overfitting Perez2017TheEO , to encourage local smoothness mixup , and to improve generalization Balestriero2022DAclass
in machine learning pipelines such as deep neural networks. Notably, effective data augmentation, which incorporates classrelated data invariance and enriches the inclass sample, is one of the key contributing factors for representation learning with weak or self supervision
chen2020simple ; FixMatch .Given a task, we aim to generate “good” augmentations efficiently. As part of the machine learning model pipeline, an autonomous domainagnostic but taskinformed data augmentation mechanism is desirable. However, a number of challenges exist. (1) Existing augmentation operators are usually handcrafted based on domain expert knowledge, which is not always available in some domain Yang2021MedMNIST . For example, widely used augmentations for natural images are not effective on medical images. Moreover, the performance of those machine learning pipelines drastically varies with different choices of data augmentations. (2) Existing few autonomous augmentation approaches developed lately are neither fully autonomous nor universally applicable to varying domains. Although a few autonomous data augmentation approaches have been developed in recent years AutoAugment ; RandAugment , they train policies to produce a sequence of predefined augmentation operations and thus are not fully automated and are limited to a few domains. (3) Existing augmentations usually do not fully utilize the task feedback (i.e., taskagnostic) and may be suboptimal for the targeted task. A class of automated data augmentation methods train an extra data generative model to generate new augmentations from scratch given a realworld example Antoniou2017DAGAN . However, they require training a generative model, which is a nontrivial task in practice that may either relies on strong prior knowledge or a substantial increased number of training examples.
In this paper, we first investigate the conditions required to generate domainagnostic but taskinformed data augmentations. Consider a representation learning pipeline, we started from a probabilistic graphical model that describes the relations among the label , the nuisance , the example , its augmentation , and the latent representations . We argue that a minimumsufficient representation for the task preserves the label information but excludes other distractive information from the nuisance. We then investigate the conditions for an augmentation that results in learning such preferred representations. These conditions motivate an optimization objective that can be used to produce automated domainagnostic but taskinformed data augmentations for each example, without replying on predefined augmentation operators or specific domain knowledge. Consequently, our proposed optimization objective addresses all aforementioned challenges.
For practicality, we further propose a surrogate of the derived objective that can be efficiently computed from the intermediatelayer representations of the modelintraining. The surrogate is built upon the data likelihood estimation through perceptual distance
laidlaw2020perceptual defined on the intermediate layers’ representations. Specifically, our proposed surrogate objective maximizes the perceptual distance between and , under a label preserving constraint on the model prediction of . This problem can be efficiently solved by optimizing its Lagaragian relaxation. Thereby, given and its label , the solution to our surrogate objective generates “hard positive examples” for without loosing its label information. Once generated, is used to train the model towards producing the minimumsufficient representation for the targeted task. Our proposed method, named LabelPreserving Adversarial AutoAugment (LPA3), does not require any extra generative models such as Generative Adversarial Networks, unlike previous automated augmentation methods
tian2020makes . We further propose a sharpnessaware criterion selecting only the most informative examples to apply our autoaugmentation on so it does not cause expensive extra computation.Our proposed LPA3 is a general and autonomous data augmentation technique applicable to a variety of machine learning tasks, such as supervised, semisupervised and noisylabel learning. Moreover, we demonstrate that it can be seamlessly integrated with existing algorithms for these tasks and consistently improve their performance. In experiments on the three learning tasks, we equip LPA3
with existing methods and obtain significant improvement on both the learning efficiency and the final performance. The generated augmentations are optimized for the modelintraining in a targettaskaware manner and thus notably accelerate the slow convergence in computationally intensive tasks such as semisupervised learning. It is worth noting that our augmentation can consistently bring improvement to tasks without domain knowledge or strong predefined augmentations such as medical image classification, on which previous image augmentations lead to performance degeneration.
2 Related work
Handcrafted vs. Autonomous Data Augmentations. Most of the existing widely used data augmentations are handcrafted based on domain expert knowledge oord2018representation ; tian2020contrastive ; misra2020self ; chen2020simple ; cubuk2020randaugment . For example, MoCo he2020momentum and InstDis wu2018unsupervised create augmentations by applying a stochastic but predefined data augmentation function to the input. CMC tian2020contrastive splits images across color channels. PIRL misra2020self generates data augmentations through random JigSaw Shuffling. CPC oord2018representation renders strong data augmentations by utilizing RandAugment RandAugment , which learns a policy producing a sequence of predefined augmentation operations selected from a pool AutoAugment . AdvAA zhang2019adversarial designs a adversarial objective to learn the augmentation policy. RandAugment ; AutoAugment ; zhang2019adversarial are all based on predefined operations which is not available in certain domains, and their objective cannot guarantee the labelpreserving of the generated data which may lead to suboptimal performance. “InfoMin” principle of data augmentation is proposed tian2020makes to minimize the mutual information between different views (equivalent to ). However, their theory depends on access to a minimal sufficient encoder which may be difficult to obtain. In contrast, we not only consider how to generate optimal views or augmentations, but also consider generating the minimal sufficient representation. The algorithm tian2020makes deploys a generator to render augmentation (which may be costly to train especially on nonnatural image domains), while we directly learn the augmentation through gradient descent w.r.t. the input.
Information Theory for Representation Learning. Information theory is introduced in deep learning to measure the quality of representations tishby2015deep ; achille2018emergence . The key idea is to use information bottleneck methods tishby2000information ; tishby2015deep
to encourage the learned representation being minimal sufficient. Mutual information objectives are commonly used in selfsupervised learning. For example, InfoMax principle
linsker1988self used by many works aims to maximize the mutual information between the representation and the input tian2020contrastive ; bachman2019learning ; wu2020importance . But simply maximizing the mutual information does not always lead to a better representation in practice tschannen2019mutual . In contrast, InfoMin principle tian2020makes minimizes the mutual information between different views. Both InfoMax and the InfoMin principles can be associated with our proposed representation learning criteria in Section 4, as they lead to sufficiency and minimality of the learned representation, respectively.Augmentation in Selfsupervised Contrastive Learning. Selfsupervised Contrastive Representation Learning oord2018representation ; hjelm2018learning ; wu2018unsupervised ; tian2020contrastive ; sohn2016improved ; chen2020simple learn representation through optimization of a contrastive loss which pulls similar pairs of examples closer while pushing dissimilar example pairs apart. Creating multiple views of each example is crucial for the success of selfsupervised contrastive learning. However, most of the data augmentation methods used in generating views, although sophisticated, are handcrafted or not learningbased. Some use luminance and chrominance decomposition tian2020contrastive , while others use random augmentation from a pool of augmentation operators wu2018unsupervised ; chen2020simple ; bachman2019learning ; he2020momentum ; ye2019unsupervised ; srinivas2020curl ; zhao2021distilling ; zhuang2019local . Recently, adversarial perturbation based augmentation has been proposed to generate more challenging positives/negatives for contrastive learning yang2022identity ; ho2020contrastive .
Augmentation in Semisupervised Learning. Data augmentation plays an important role in semisupervised learning, e.g., (1) consistency regularization Consistency ; PiModel enforces the model to produce similar outputs for a sample and its augmentations; (2) pseudo labeling PseudoLabel trains a model using confident predictions produced by itself MeanTeacher for unlabeled data. Data augmentations are critical FixMatch ; ReMixMatch because they determine both the output targets and input signals: (1) accurate pseudo labels are achieved by averaging the predictions over multiple augmentations; (2) weak augmentations (e.g., flipandshift) are important to produce confident pseudo labels, while strong augmentations AutoAugment ; RandAugment ) are used to train the model and expand the confidence regions (so more confident pseudo labels can be collected later). Data selection FlexMatch ; zhou2020time for highquality pseudo labels is also critical and its criterion is estimated on augmentations, e.g., the confidence MixMatch or timeconsistency zhou2020time of each sample.
Augmentation in Noisylabel Learning Two primary challenges in noisylabel learning is clean label detection Liu2016ClassificationWN ; han2018co ; jiang2018mentornet and noisy label correction by pseudo labels reed2014training ; arazo2019unsupervised ; Li2020DivideMix . Both significantly depend on the choices of data augmentations since the former usually relies on confidence thresholding and augmentations help rule out the overconfident samples, while the latter relies on the quality of semisupervised learning. Moreover, as shown in previous works Li2020DivideMix ; zhou2021robust , removing strong augmentations such as RandAugment can considerably degenerate the noisy label learning performance.
3 Preliminaries
Basics of Information Theory Our analyses make frequent use of information theoretical quantities cover1991information
. Given a joint distribution
and its marginal distributions , , we define their entropy as , , and . Furthermore, we define the conditional entropy of given as . Finally, we define the mutual information between and as .Notations and Problem Setup. In this paper, we use bold capital letters (e.g.,
) to denote random variables, lowercase letters (e.g.,
) to denote their realizations, and curly capital letters (e.g., ) to denote the corresponding sample spaces.Since we mainly consider supervised and semisupervised problems, we define Let be the joint distribution of data observation and label , where
is a random vector taking values on a finite observation space
(e.g., images) andis a discrete random variable taking values on the label space
(e.g., classes). Our goal is to learn a classifier to predict
from an observation .Tasknuisance Decomposition. To advance the analysis, we decouple the randomness in into two parts, one pertaining to the label and another independent to the label. Concretely, we define a random variable nuisance such that 1) the nuisance is independent to the label , i.e., ; and 2) the observation is a deterministic function of the nuisance and the label , i.e., for some . Lemma 3.1 demonstrates that such a random variable always exists.
Lemma 3.1 (Tasknuisance Decomposition achille2018emergence ).
Given a joint distribution , where is a discrete random variable, we can always find a random variable independent of such that , for some deterministic function .
Remarks. We can rewrite the conditions of tasknuisance decomposition in terms of information theory. 1) Since the nuisance is independent to the label , we have ; and 2) Since the nuisance and the label determines the observation , we have .
4 Principles of Representation Learning: Theoretical Interpretation
4.1 What Is A Good Representation?
In realworld applications, the observation is usually complex in a highdimensional space , making it hard to directly learn a good classifier for
. To remedy this curse of dimensionality, it is important to learn a good representation of
, i.e., learn an encoder that maps the highdimensional observation into a lowdimensional representation . We illustrate the process of data generation and representation learning by a probabilistic graphical model as shown in Figure 0(a).An ideal encoder should keep the important information from (e.g. labelrelevant information) and maximally discard the noise or nuisance of , such that it is much easier to learn a classifier from than from . Based on the above intuition, we define an optimal representation of , which has sufficient information for classifying w.r.t. , while remaining little information of the nuisance.
Definition 4.0.1 (Minimal Sufficient Representation (Optimal Representation)).
For a Markov chain
, we say thata representation of is sufficient for if , and is minimal sufficient for if is sufficient and for all satisfying .Remark. Due to the property of mutual information, we have . The lower is, the more “minimal” the representation is. When , the representation is minimal sufficient, which is a desirable property as characterized by many prior works tishby2015deep ; achille2018emergence .
Definition 4.0.1 characterizes how good a sufficient representation is, based on how much redundant information is remained. Recall that comes from a deterministic function of label and nuisance . The redundancy of can also be measured by the mutual information between and . Achille et al. achille2018emergence prove that if a representation is sufficient and is invariant to nuisance , i.e., , then is also minimal. However, since is not known, it is hard to directly encourage the representation to be invariant to .
Can we learn an minimal sufficient representation in a principled way? Inspired by the recent success of data augmentation techniques in selfsupervised learning and semisupervised learning, we find that data augmentation can implicitly encourage the representation to be invariant to the nuisance . However, most augmentation methods are driven by predefined transformations, which do not necessarily render a minimal sufficient representation. In the next section, we will analyze the effects of data augmentation in representation learning in details.
4.2 Proper Data Augmentation Leads to (Near)Optimal Representation
In this section, we investigate the role of data augmentation for learning good representations. We first make the following mild assumption on the underlying relationship between and .
Assumption 4.1.
There exists a deterministic function , i.e., .
Assumption 4.1 requires that there exists a “perfect classifier” that identifies the label of the observation with no error, which is common in practice. Note that for data with ambiguity, a tie breaker can be used to map each observation to a unique label. Therefore, Assumption 4.1 is realistic.
Let be a deterministic augmentation function such that is the augmented data, where is a random variable denoting the augmentation selection. For example, if is an image sample, is the augmentation “rotate by 90 degree”, then is the corresponding rotated image sample. We learn an encoder that maps the augmented data to a representation . With this augmentation processes, the graphical model in Figure 0(a) is updated to Figure 0(b).
We show in the theorem below that if the augmentation process preserves the information of , can be sufficient for . Furthermore, if the augmented data contains no information of the original nuisance , will be invariant to and thus will become a minimal sufficient representation.
Theorem 4.2.
Consider label variable , observation variable and nuisance variable satisfying Assumption 4.1. Let be the augmentation variable, be the augmented data, and be the solution to
(1)  
subject to 
Then, is a minimal sufficient representation of for label if the following conditions hold:
Condition (a): ( is an inclass augmentation) and
Condition (b): ( does not remain much information about ).
Remarks.
(1) The objective of learning can be either taskindependent (maximizing ), or taskdependent (maximizing ). The former matches the “InfoMax” principle commonly used in selfsupervised learning works linsker1988self ; hjelm2018learning , while the latter can be achieved by supervised training (e.g., learning a classifier of for with crossentropy loss).
(2) When Condition (b) holds for , representation is optimal (minimal sufficient).
Theorem 4.2, proved in Section B.1, shows that if we have a good augmentation that maximally perturbs the labelirrelevant information while keeps the labelrelevant information, then the representation learned on the augmented data can be minimal sufficient. Theorem 4.2 serves as a principle of constructing augmentation. Based on this principle, we propose an autoaugment algorithm in Section 5, and verify the algorithm in a wide range tasks in Section 6.
5 Proposed Methods
In this section, we introduce our data augmentation and how to obtain the augmentation using the representation learning network . Then we show how to plug our augmentation into the representation learning procedure of .
5.1 LabelPreserving Adversarial AutoAugment (LpA3)
As illustrated in the previous section, an ideal data augmentation for representation learning should contain as little information about nuisance as possible while still keeping all the information about class . Since is not observed, we transfer the objective into since and is a constant under the constraint . Thus the optimization problem is:
(2) 
Implementation of Mutual Information. To solve Equation 2, computing the mutual information terms , and is required. Next, we will show how to compute these terms using a neural net classifier , parameterized by , that consists of two components: a representation encoder and a predictor . Specifically, , where the representation encoder maps input into representation , and the predictor predicts the class of . This is demonstrated in Figure 2.
Constraint implementation. Since and , we can remove the term in both sides and turn the constraint into . Thus we only need to compute the conditional entropy of given or , which can be approximated through the neural net classifier:
, where we use softmax class probability
to approximate the likelihood . And can be computed similarly.Objective implementation. Then we show how to compute the objective . Since where is not related to and thus can be neglected, we only need to compute . We use the Learned Perceptual Image Patch Similarity (LPIPS) zhang2018unreasonable between and to compute the data likelihood since LPIPS distance is a widely used metric to measure the data similarity in data generative model field johnson2016perceptual ; zhang2020cross and many previous work has shown that LPIPS distance is the best surrogate for human comparisons of similarity zhang2018unreasonable ; laidlaw2020perceptual , compared with any other distance including and distance. Although such surrogate may have error, it worth noting that Theorem 4.2 allows the surrogate to have error. The LPIPS distance is defined by the distance of stacked feature maps from a neural network. Here we use to compute the LPIPS distance. Let has layers and denotes these channelnormalized activations at the th layer of the network. Next, the activations are normalized again by layer size and flattened into a single vector , where and are the width and height of the activations in layer , respectively. The LPIPS distance between input and the augmentation is then defined as:
(3) 
Constraint Relaxation for Efficiency. Now, given an input , its data augmentation can be computed by solving the following optimization problem using the neural network in practice:
(4) 
The equality constraint in Equation 4 is too strict to solve since it may be inefficient to search for an that exactly satisfies . Thus we relax the constraint with a small and change the constaint into: . It’s worth noting that if is sufficiently small, the label is still well preserved. There is a tradeoff to the value of , we search to find a sweet spot where the problem is practical to solve and meanwhile the label is well preserved.
There are many offtheshelf methods that solve Equation 4, and here we apply the Fast Lagrangian Attack Method laidlaw2020perceptual as a demonstration. We initialize by plus a uniform noise. And we find the optimal by solving the following the Lagrangian multiplier function and gradually scheduling the value of the multiplier :
(5) 
The detailed procedure of the algorithm can be found in Appendix 2. The algorithm has a similar form as adversarial attack zhang2020principal ; yang2021class in that they both find an optimal augmentation by adding perturbations to the original image . However, the difference is that we aim to generate hard augmentation that preserves the label, while adversarial attack aims to change the class label.
5.2 Plugging LpA3 into a Representation Learning Task
One primary advantage of LPA3 is that it only requires a neural net to produce the augmentation and can be the current representation learning model, so we can plug LPA3 into any representation learning procedure requiring no additional parameters, which is plugandplay and parameterfree. At each step, we first fix to generate the augmentation by solving Equation 5 using Algorithm 2. And then we train by running the original representation learning algorithm using our augmentation .
Data selection. It is not necessary to find hard positives for every sample. To save more computation, we can apply a sharpnessaware criterion, i.e., timeconsistency (TCS) zhou2020time , to select the most informative data ( data with the lowest TCS in Algorithm 1) that have sharp loss landscapes, which indicate the existence of nearby hard positives, and we only apply LPA3 to them. It reduces the computational cost without degrading the performance because (1) the improvement brought by augmentations is limited for examples whose loss already reaches a flat minimum, while the model does not generalize well near examples with a sharp loss landscapes; and (2) the hard positives for examples with flat loss landscape are distant from the original ones and might introduce extra bias to the training.
LPA3 is compatible with any representation learning task minimizing a loss , which takes in a data batch and a model to output a loss value. here denotes the groundtruth label for labeled data and pseudo label for unlabeled data. The pseudocode of plugging LPA3 into the representation learning procedure with TCSbased data selection is provided in Algorithm 1.
6 Experiments
In this section, we apply LPA3 as a data augmentation method to several popular methods for three different learning tasks, i.e., (1) semisupervised classification; (2) noisylabel learning and (3) medical image classification. In all the experiments, LPA3 can (1) consistently improve the convergence and test accuracy of existing methods and (2) autonomously produce augmentations that bring nontrivial improvement even without any domain knowledge available. A walkclock time comparison is given in the Appendix, showing LPA3 effectively reduces the computational cost. In addition, we conduct a thorough sensitivity study of LPA3 by changing (1) labelpreserving margin and (2) data selection ratio on the three tasks. More experimental details can be found in the Appendix.
6.1 Medical Image Augmentations produced by LpA3 vs. RandAugment
We visualize data augmentations generated by LPA3 and RandAugment RandAugment on the testset of DermaMnist medmnistv2 with a ResNet18 classifier and its confidence on the groundtruth class in Fig. 3. We also use ScoreCAM wang2020score as an interpretation method to highlight the area in each image that the classifier relies on to make the prediction. we find that LPA3 preserves relevant derma areas highlighted by ScoreCAM and they are consistent with those in the original image. On the contrary, RandAugment changes the color or occludes those derma areas, resulting in highly different ScoreCAM heatmaps and hence wrong predictions (red bounding box in Fig. 3). Instead, LPA3 can preserve the class information and mainly perturb the classunrelated area in the original image.
6.2 Applying LpA3 to Three Different Representation Learning Tasks
Here we apply LPA3 to three different tasks by pluging LPA3 to existing baselines of each task. Fig. 4 shows that LPA3 greatly speeds up the convergence of each baseline.
Semisupervised learning To evaluate how LPA3
improves the learning without sufficient labeled data, we conduct experiments on semisupervised classification on standard benchmarks including CIFAR
krizhevsky2009learningand STL10
coates2011analysis where only a very small amount of labels are revealed. We apply LPA3 in FixMatch sohn2020fixmatch and compare it with the original FixMatch and InfoMin tian2020makes , a learnable augmentation method for semisupervised learning. Their results are reported in Table 1, where LPA3 consistently improves FixMatch and the improvement becomes more significant if reducing the labeled data. It’s worth noting that the original FixMatch already employs a carefully designed set of predefined augmentations cubuk2020randaugment that have been tuned to achieve the best performance, indicating that LPA3 is complementary to existing data augmentations. Moreover, LPA3 also outperforms InfoMin by a large margin (), which indicates that LPA3 is also superior to existing learnable augmentations.Dataset  CIFAR10  CIFAR100  STL10  

# Label  40  250  4000  400  2500  10000  1000 
InfoMin (RGB) tian2020makes              86.0 
InfoMin (YDbDr) tian2020makes              87.0 
FixMatch sohn2020fixmatch  89.513.14  93.810.29  94.660.13  49.302.45  67.210.94  74.310.35  91.590.16 
FixMatch sohn2020fixmatch + LPA3  92.391.21  94.030.31  95.110.17  56.161.82  72.230.57  77.110.16  92.630.14 
Noisylabel Learning Data augmentation is critical to noisylabel learning by providing different views of data to prevent neural nets from overfitting to noisy labels. We apply LPA3 to two stateoftheart methods DivideMix Li2020DivideMix and PES bai2021understanding on CIFAR with different ratios of noise labels. LPA3 can consistently improve the performance of these two SoTA methods and the improvement is more significant in more challenging cases with higher noise ratios, e.g., on CIFAR100 with 90% of labels to be noisy, LPA3 improves PES by (Table 2).
Dataset  CIFAR10  CIFAR100  

Noise Ratio  50%  80%  90%  50%  80%  90% 
Mixup zhang2017mixup  87.1  71.6  52.2  57.3  30.8  14.6 
Pcorrection yi2019probabilistic  88.7  76.5  58.2  56.4  20.7  8.8 
Mcorrelation arazo2019unsupervised  88.8  76.1  58.3  58.0  40.1  14.3 
DivideMix Li2020DivideMix  94.4  92.9  75.4  74.2  59.6  31.0 
DivideMix+LPA3  94.890.05  93.700.19  79.351.33  74.120.23  61.000.34  32.550.25 
PES bai2021understanding  94.890.12  92.150.23  84.980.36  74.190.23  61.470.38  21.153.15 
PES+LPA3  95.100.14  93.260.21  87.710.36  74.570.25  62.980.49  40.611.10 
Method  PathMNIST  DermaMNIST  TissueMNIST  BloodMNIST 

ResNet18  94.340.18  76.140.09  68.280.17  96.810.19 
ResNet18+RandAugment  93.520.09  73.710.33  62.030.14  95.000.21 
ResNet18+LPA3  94.420.24  76.220.27  68.630.14  96.970.06 
ResNet50  94.470.38  75.240.27  69.690.23  96.910.06 
ResNet50+RandAugment  94.020.37  71.650.30  65.130.33  95.140.06 
ResNet50+LPA3  94.570.07  75.710.22  69.890.08  97.010.32 
OctMNIST  OrganAMNIST  OrganCMNIST  OrganSMNIST  
ResNet18  78.670.26  94.210.09  91.810.12  81.570.07 
ResNet18+RandAugment  76.000.24  94.180.20  91.380.14  80.520.32 
ResNet18+LPA3  80.270.54  94.730.21  92.410.22  82.280.38 
ResNet50  78.370.52  94.310.14  91.800.14  81.110.21 
ResNet50+RandAugment  76.630.58  94.590.17  91.100.12  80.470.37 
ResNet50+LPA3  79.400.36  94.950.19  92.160.23  82.150.08 
. All the models are trained for 100 epochs. Error bars (mean and std) are computed over three random trails.
Medical Image Classification To evaluate the performance in specific areas without domain knowledge, we compare LPA3 with existing data augmentations on medical image classification tasks from MedMNIST medmnistv2 , which is composed of several subdataset with various styles of medical images. We compare our LPA3 with RandAugment cubuk2020randaugment on training ResNet18 and ResNet50 he2016deep . We report the results in Table 3, where RandAugment designed for natural images fails to improve the performance in this scenario. In contrast, LPA3 does not rely on any domain knowledge brings improvement to all the datasets, especially for OctMNIST where the improvement is over . The results indicate that handcrafted strong data augmentations do not generalize to all domains but LPA3 can autonomously produce augmentations guided by our representation learning principle without relying on any domain knowledge.
6.3 Sensitivity Analysis of Hyperparameters
Label preserving margin : We evaluate how LPA3 performs with different label preserving margin on the three tasks. The results are presented in Fig. 5, where a reverse Ushape is observed. And LPA3 using all the evaluated outperforms baselines, which indicates LPA3 is robust to .
Data selection ratio: We evaluate the performance of LPA3 with different amount of data selected on the three tasks. As shown in Fig. 5, selecting all the data does not perform the best since some data’ augmentations are useless to apply data augmentation. Moreover, selecting only data to apply LPA3 can outperform all baselines by a large margin, especially on MedMNIST where the improvement is , which verifies the effectiveness of LPA3 and our data selection method.
7 Conclusion
In this paper, we study how to automatically generate domainagnostic but taskinformed data augmentations. We first investigate the conditions required for augmentations leading to representations that preserves the task (label) information and then derive an optimization objective for the augmentations. For practicality, we further propose a surrogate of the derived objective that can be efficiently computed from the intermediatelayer representations of the modelintraining. The surrogate is built upon the data likelihood estimation through perceptual distance. This leads to LPA3, a general and autonomous data augmentation technique applicable to a variety of machine learning tasks, such as supervised, semisupervised and noisylabel learning. In experiments, we demonstrate that LPA3 can consistently bring improvement to SoTA methods for different tasks even without domain knowledge. In future work, we will extend LPA3 to more learning tasks and further improve its efficiency.
Acknowledgements
This work was supported by, the Major Science and Technology Innovation 2030 “New Generation Artificial Intelligence” key project under Grant 2021ZD0111700, NSFC No. 61872329, No. 62222117, and the Fundamental Research Funds for the Central Universities under contract WK3490000005. Huang, Sun and Su are supported by NSFIISFAI program, DODONROffice of Naval Research, DODDARPADefense Advanced Research Projects Agency Guaranteeing AI Robustness against Deception (GARD). Huang is also supported by Adobe, Capital One and JP Morgan faculty fellowships.
References
 [1] Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018.
 [2] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks, 2017.
 [3] Eric Arazo, Diego Ortego, Paul Albert, Noel O’Connor, and Kevin McGuinness. Unsupervised label noise modeling and loss correction. In International conference on machine learning, pages 312–321. PMLR, 2019.
 [4] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32, 2019.
 [5] Yingbin Bai, Erkun Yang, Bo Han, Yanhua Yang, Jiatong Li, Yinian Mao, Gang Niu, and Tongliang Liu. Understanding and improving early stopping for learning with noisy labels. Advances in Neural Information Processing Systems, 34, 2021.
 [6] Randall Balestriero, Leon Bottou, and Yann LeCun. The effects of regularization and data augmentation are class dependent, 2022.
 [7] David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semisupervised learning with distribution matching and augmentation anchoring. In International Conference on Learning Representations (ICLR), 2020.
 [8] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semisupervised learning. In Advances in Neural Information Processing Systems 32 (NeurIPS), pages 5050–5060. 2019.
 [9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
 [10] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of singlelayer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
 [11] Thomas M Cover and Joy A Thomas. Information theory and statistics. Elements of information theory, 1(1):279–335, 1991.
 [12] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space, 2019.

[13]
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le.
Randaugment: Practical automated data augmentation with a reduced
search space.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
, pages 702–703, 2020.  [14] Ekin Dogus Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation policies from data. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 [15] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Coteaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pages 8527–8537, 2018.
 [16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
 [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [18] R Devon Hjelm, Alex Fedorov, Samuel LavoieMarchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2018.
 [19] ChihHui Ho and Nuno Nvasconcelos. Contrastive learning with adversarial examples. Advances in Neural Information Processing Systems, 33:17081–17093, 2020.
 [20] Lu Jiang, Zhengyuan Zhou, Thomas Leung, LiJia Li, and Li FeiFei. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pages 2304–2313, 2018.

[21]
Justin Johnson, Alexandre Alahi, and Li FeiFei.
Perceptual losses for realtime style transfer and superresolution.
In European conference on computer vision, pages 694–711. Springer, 2016.  [22] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 [23] Alex Kurakin, ChunLiang Li, Colin Raffel, David Berthelot, Ekin Dogus Cubuk, Han Zhang, Kihyuk Sohn, Nicholas Carlini, and Zizhao Zhang. Fixmatch: Simplifying semisupervised learning with consistency and confidence. In NeurIPS, 2020.
 [24] Cassidy Laidlaw, Sahil Singla, and Soheil Feizi. Perceptual adversarial robustness: Defense against unseen threat models. arXiv preprint arXiv:2006.12655, 2020.
 [25] DongHyun Lee. Pseudolabel : The simple and efficient semisupervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, 2013.
 [26] Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semisupervised learning. arXiv preprint arXiv:2002.07394, 2020.
 [27] Ralph Linsker. Selforganization in a perceptual network. Computer, 21(3):105–117, 1988.
 [28] Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38:447–461, 2016.
 [29] Ishan Misra and Laurens van der Maaten. Selfsupervised learning of pretextinvariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6707–6717, 2020.
 [30] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 [31] Luis Perez and Jason Wang. The effectiveness of data augmentation in image classification using deep learning. ArXiv, abs/1712.04621, 2017.
 [32] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
 [33] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semisupervised learning. In Advances in Neural Information Processing Systems 29 (NeurIPS), pages 1163–1171. 2016.
 [34] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semisupervised learning. In Advances in Neural Information Processing Systems 29 (NeurIPS), pages 1163–1171. 2016.
 [35] Kihyuk Sohn. Improved deep metric learning with multiclass npair loss objective. Advances in neural information processing systems, 29, 2016.
 [36] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and ChunLiang Li. Fixmatch: Simplifying semisupervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33:596–608, 2020.
 [37] Aravind Srinivas, Michael Laskin, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. arXiv preprint arXiv:2004.04136, 2020.
 [38] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In Advances in Neural Information Processing Systems 30 (NeurIPS), pages 1195–1204. 2017.
 [39] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In European conference on computer vision, pages 776–794. Springer, 2020.
 [40] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? Advances in Neural Information Processing Systems, 33:6827–6839, 2020.
 [41] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
 [42] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015.
 [43] Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. In International Conference on Learning Representations, 2019.

[44]
Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr
Mardziel, and Xia Hu.
Scorecam: Scoreweighted visual explanations for convolutional neural networks.
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 24–25, 2020.  [45] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
 [46] Mike Wu, Chengxu Zhuang, Daniel Yamins, and Noah Goodman. On the importance of views in unsupervised representation learning. preprint, 3, 2020.
 [47] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
 [48] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L Yuille, and Quoc V Le. Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 819–828, 2020.
 [49] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2: A largescale lightweight benchmark for 2d and 3d biomedical image classification, 2021.
 [50] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2: A largescale lightweight benchmark for 2d and 3d biomedical image classification. arXiv preprint arXiv:2110.14795, 2021.
 [51] Kaiwen Yang, Tianyi Zhou, Xinmei Tian, and Dacheng Tao. Identitydisentangled adversarial augmentation for selfsupervised learning. In International Conference on Machine Learning, pages 25364–25381. PMLR, 2022.
 [52] Kaiwen Yang, Tianyi Zhou, Yonggang Zhang, Xinmei Tian, and Dacheng Tao. Classdisentanglement and applications in adversarial detection and defense. Advances in Neural Information Processing Systems, 34:16051–16063, 2021.
 [53] Mang Ye, Xu Zhang, Pong C Yuen, and ShihFu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6210–6219, 2019.
 [54] Kun Yi and Jianxin Wu. Probabilistic endtoend noise correction for learning with noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7017–7025, 2019.
 [55] Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semisupervised learning with curriculum pseudo labeling. In NeurIPS, 2021.
 [56] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
 [57] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
 [58] Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen. Crossdomain correspondence learning for exemplarbased image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5143–5153, 2020.

[59]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.
The unreasonable effectiveness of deep features as a perceptual metric.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.  [60] Xinyu Zhang, Qiang Wang, Jian Zhang, and Zhao Zhong. Adversarial autoaugment. arXiv preprint arXiv:1912.11188, 2019.
 [61] Yonggang Zhang, Xinmei Tian, Ya Li, Xinchao Wang, and Dacheng Tao. Principal component adversarial example. IEEE Transactions on Image Processing, 29:4804–4815, 2020.
 [62] Nanxuan Zhao, Zhirong Wu, Rynson WH Lau, and Stephen Lin. Distilling localization for selfsupervised representation learning. In 35th AAAI Conference on Artificial Intelligence (AAAI21), pages 10990–10998. AAAI Press, 2021.
 [63] Tianyi Zhou, Shengjie Wang, and Jeff Bilmes. Timeconsistent selfsupervision for semisupervised learning. In International Conference on Machine Learning, pages 11523–11533. PMLR, 2020.
 [64] Tianyi Zhou, Shengjie Wang, and Jeff Bilmes. Robust curriculum learning: from clean label detection to noisy label selfcorrection. In International Conference on Learning Representations, 2021.

[65]
Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins.
Local aggregation for unsupervised learning of visual embeddings.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6002–6012, 2019.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work?

Did you discuss any potential negative societal impacts of your work?

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Will open source later

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Aappendix
C 
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators?

Did you mention the license of the assets?

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Supplementary Material
Appendix A Algorithmic Details
a.1 Data Selection Via TimeConsistency
We use timeconsistency (TCS) [63] to select informative sample to apply our augmentation, which computes the consistency of the output distribution for each sample along the training procedure. Specifically, TCS metric for an individual sample is an negative exponential moving average of over training history before :
(6)  
(7) 
where
is Kullback–Leibler divergence,
is pesudo label (for unlabeled data) or real label (for labeled data) for at step and is a discount factor. Intuitively, the KLdivergence between output distributions measures how consistent the output is between two consecutive steps, and a moving average of naturally captures inconsistency of over time quantify. And larger means better timeconsistency. We select top sample with the lowest TCS to apply our data augmentation because samples with small TCS tend to have sharp loss landscapes. These samples provide more informative gradients than others and applying our modeladaptive data augmentations can bring more improvement to their representation invariance and loss smoothness. In Fig. 5, we conduct a thorough sensistivity analysis on over three tasks and find that sample selection with TCS can effectively improve the performance. Moreover, in this way, we do not need to apply our augmentation to every training samples and thus save the training cost.a.2 Fast Lagaragian Attack Method
We use the fast lagaragian perceptual attack method (Algorithm 3 in [24]) to solve the Lagragian multiplier function in Equation.(5), which finds the optimal through gradient descent over , starting at with a small amount of noise added. During the gradient descent steps, is increased exponentially form 1 to 10 and the step size is decreased. is set to be 5 for all the experiments.
Appendix B Additional Theoretical Results and Proofs
b.1 Proof of Theorem 4.2
Proof of Theorem 4.2.
Problem (1) contains two versions of objectives for :
(8) 
or
(9) 
Both Problem (8) and Problem (9) lead to the minimal sufficient representation . We first prove the more challenging Problem (8) objective.
I. For Problem (8): .
We first prove the sufficiency of , then prove the minimality of .
1) Proof of sufficiency
Since , and does not depend on , we have that the solution to Problem (8), , minimizes under constraint .
Then, we show that also minimizes .
We know that because of the Markovian property. Since , we have
(10) 
Then we can derive
(11)  
(12)  
(13)  
(14) 
Equality (13) holds because comes from a deterministic function of and . Since does not depend on , minimizes as it minimizes .
Also, we known that , so we can further obtain
(15)  
(16)  
(17)  
(18)  
(19) 
where Equation (19) holds because .
Therefore, minimizes .
Following the similar procedure as above (Equation (10) to Equation (19)), we are able to show that
(20) 
So also minimizes , which can be further decomposed into . Next we show by contradiction that equals to and thus .
Define Assume that the optimizer minimizes , but does not satisfy sufficiency, i.e., . We will then show that one can construct another representation such that , conflicting with the assumption that minimizes . The construction of works as follows. Since the augmented data satisfies (Condition (a) of Theorem 4.2), we have . Hence, there exists a function such that . Define , then we have
(21)  
(22)  
(23)  
(24)  
(25)  
(26) 
Therefore, the constructed conflicts with the assumption. We can conclude that any optimizer has to satisfy , which is equivalent to . The sufficiency of is thus proven.
As a result, the maximizer to Problem (8), , satisfies .
2) Proof of Minimality
Since is a deterministic function of and , we have
(27)  
(28)  
(29) 
where the equality in Equation (27) holds because and both hold.
And we can derive
(30)  
Moreover, we know that
(31) 
And we have , so
(32)  
Combining (30) and (32), we have
(33) 
Note that (33) holds for all sufficient statistics of w.r.t. .
Then we first show that is minimal of w.r.t. by contradiction.
Assume there exists a random variable satisfying , such that .
Then we have
Hence, we get which is impossible. So we have that there does not exist such an , and is minimal representation of w.r.t. .
Then, since we have (thanks to Data Processing Inequality), is also a minimal sufficient statistic of w.r.t. .
II. For Problem (9): .
Since the objective is to maximize , we only need to show that achieves the maximum mutual information with . According to the above proof for Problem (8), we know that there exist such that and . Hence, the optimizer to Problem (9) must satisfy sufficiency.
The proof of minimality is identical to the one under Problem (8).
∎
b.2 Additional Theoretical Results on Augmentation Properties
The two conditions in Theorem 4.2, Condition (a) or Condition (b), requires that the augmentation is (a) sufficient and (b) ()minimal. These two conditions are closely related to some augmentation rationales in prior papers. For example, Wang et al. [45] propose a symmetric augmentation, which can result in Condition (a), as formalized in Lemma B.1 below. Furthermore, Tian et al. [40] propose an “InfoMin” principle of data augmentation, that minimizes the mutual information between different views (equivalent to ). We show by Lemma B.2 that this InfoMin principle leads to the above Condition (b). In contrast, our Theorem 4.2 characterizes two key conditions of augmentation and directly relate them to the optimality of the learned representation.
Lemma B.1 (Sufficiency of Augmentation).
Suppose the original and augmented observations and satisfy the following properties: equationparentequation
(40a)  
(40b) 
Then the augmented observation is sufficient for the label , i.e., .
Proof of Lemma b.1.
(41)  
(42)  
(43)  
(44)  
(45) 
where the third equation utilizes the property of symmetric augmentation. ∎
Lemma B.2 (Maximal Insensitivity to Nuisance).
If Assumption 4.1 holds, i.e., , the mutual information can be decomposed as
(46) 
Since is sufficient, i.e., is a constant, minimizing is equivalent to minimizing .
Appendix C Experiments
c.1 Implementation Details
All codes are implemented with Pytorch
^{1}^{1}1https://pytorch.org/. To train the neural net with LPA3 augmentation, we apply seperate batch norm layer (BN), i.e., agumented data and normal data use different BN, which is a common strategy used by previous adversarial augmentations [19, 48, 51]. The only hyperparameters for LPA3 are label preserving margin and data selection ratio , which are tuned for each task according to the results in Sec.6.3.Semisupervised learning We reproduce Fixmatch [36] based on public code^{2}^{2}2https://github.com/kekmodel/FixMatchpytorch and apply LPA3 to it. Following [36], we used a WideResNet282 with 1.5M parameters for CIFAR10, WRN288 for CIFAR100, and WRN372 for STL10. All the models are trained for iterations. is set to 0.002 for CIFAR10 and STL10, and 0.02 for CIFAR100 and is set to be 90. Since FixMatch only apply data augmentation to those unlabeled data, here LPA3 is also applied to those unlabeled data as data augmentation. For unlabeled data, label used in LPA3 is the pesudo label generated by FixMatch algorithm.
Noisylabel learning We reproduce DivideMix [26] and PES [5] based on their official code^{3}^{3}3https://github.com/LiJunnan1992/DivideMix, https://github.com/tmllab/PES and apply LPA3 to them as data augmentation. Following [26, 5], we used a ResNet18 for CIFAR10 and CIFAR100. All the models are trained for 300 epochs. is set to 0.002 for CIFAR10 and 0.02 for CIFAR100, and is set to be 90. All the noise are symmetric noise. For noisy labeled data, label used in LPA3 is the pesudo label generated by DivideMix or PES algorithm respectively.
Medical Image Classification Here we follow the original training and evaluation protocol of MedMNIST ^{4}^{4}4https://github.com/MedMNIST/experiments and apply LPA3 to the training procedure as data augmentation. ResNet18 and ResNet50 are trained for 100 epochs with crossentropy loss on all the multiclass classfication subset of MedMNIST. is set to 0.02 and is tuned from for each dataset. The hyperparameters of RandAugment [13] is set to by following their original paper.
Sensitivity Analysis of Hyperparameters In Figure. 5, the experiments for semisupervised learning are conducted on CIFAR100 with 2500 labeled data, the experiments for noisylabel learning are conducted on CIFAR100 with 80% noisy label, and the experiments for medical image classification are conducted on DermaMNIST with ResNet50.