Long-Tailed Learning. Existing methods of long-tailed learning are generally divided into three categories: data re-sampling chawla2002smote; buda2018systematic, adjusting the classification boundary tang2020long, and re-weighting shu2019meta; lin2017focal; huang2016learning. The re-sampling’s idea is to rebalance the class distribution by over-sampling the tail classes, which is effective but prone to overfitting on the tail classes. Methods of the second category enlarges the classification boundary of the tail classes while narrowing that of the head classes, by modifying the classification threshold menon2020long, or by adjusting the weights of the output layer through normalization kang2019decoupling. The re-weighting strategy aims to assign larger loss weights to the tail classes. Conventional approaches of this category huang2016learning; huang2019deep
directly impose weights on each training sample, which is sensitive to outliers and causes unstable trainingren2020balanced. Some recent works tan2020equalization achieve re-weighting by modifying the predicted scores in the Softmax function, which yields more stable training and promising performance. This work inherits the merit of re-weighting strategy by adapting it to more challenging long-tailed scenarios with label noise.
Learning under Label Noise. Methods of learning under label noise can be categorized into two main types: sample re-weighting and relabeling. The re-weighting strategy treats samples with larger loss values as noise, and reduce their influence by giving them lower weights kumar2010self; huang2019o2u. MentorNet jiang2018mentornet learns data-driven curriculums for deep CNNs trained on corrupted labels. Meta-Weight-Net shu2019meta learns an explicit weighting function directly from a small clean data set. The relabeling strategy leverages noisy samples by refining their labels. Bootstrapping reed2014training
integrates assigned labels and model’s predictions by interpolation. Some works divide clean and noisy samples based on the priors learned from a manually generated noisy setchen2021sample, and then take advantage of noisy samples berthelot2019mixmatch.
Long-tailed Learning under Label Noise. Recently, some works have emerged to cope with long-tailed learning under label noise. HAR cao2020heteroskedastic regularizes different regions of the input space differently through data-dependent regularization technique. CurveNet jiang2021delving learns to assign proper weights to different samples according to sample’s loss curve. ROLT wei2021robust combines DivideMix and LDAM to correct noisy labels and improve tail-class performance. Different from these methods, this work makes the first trial to correct noisy labels and adjust per-class classification margin simultaneously in a learnable and adaptive manner according to the training data.
This section describes the details of our dynamic loss and its optimization through meta-learning.
Given a noisy and imbalanced train set with image and its assigned one-hot class label , our goal is to learn a classifier with learnable parameters that maps an input image to class confidence scores. Despite of the label noise and class imbalance existed in the train set, the classifier is required to accurately recognizing all classes, so a balanced and clean test set is employed.
The parameters are optimized by minimizing the classification loss on the train set:
is the cross-entropy loss. Since the train set is with both label noise and class imbalance, optimizing with such naive cross-entropy loss suffers two major drawbacks: i) the assigned labels of noisy samples do not match their ground-truths, which result in high loss values and force the model to memorize noisy labels; ii) tail classes have much lower occurrence probabilities but share the same classification margins as that of head classes and thus are prone to poor generalization.
To tackle the above two problems, we present a novel dynamic loss that simultaneously corrects noisy labels and adjusts the classification margin for different classes in an adaptive and learnable manner:
where and denote the reassigned label and the addictive classification margin for , respectively.
Concretely, as illustrated in Figure 2, the dynamic loss is equipped with a learnable label corrector parameterized by and a margin generator parameterized by , which are respectively in charge of correcting per-sample labels and per-class classification margins. We continuously optimize them jointly along with the classifier through meta-learning. Next, we describe the two counterparts as well as their optimization in detail.
The label corrector identifies noisy samples and corrects their wrongly assigned labels in a class-wise manner. For identifying noisy samples, it divides all samples into groups by class and sorts the samples in each group separately according to the loss value, evenly divides the sorted samples in each group into bins, and employs a lightweight class-wise meta net to learn to identify whether the bin is dominated by noisy or clean samples. Consequently, the loss bin index for sample can server as a reliable indicator for identifying label noise.
For label correction, as long as the classifier has not severely over-fitted on biased data, it learns mainly from the dominated clean samples and transfers the learned knowledge to noisy ones. Hence, the classifier’s predictions on noisy samples are close to their ground-truths and can be used to correct the wrongly assigned labels.
Based on the above, here comes our label corrector that reassigns a ground-truth label for sample as a weighted sum of its assigned label and the classifier’s prediction based on the loss bin index :
where is a class-dependent weighting function that maps the bin indexis noisy with a high loss value, is large and the computed is close to , hence the label corrector corrects its wrongly assigned label with the classifier’s prediction, and vice versa.
For the design of the margin generator, we begin with revisiting Label-Distribution-Aware Margin Loss (LDAM) from the perspective of generalization error bound. Due to fewer training samples, tail classes have larger generalization error bounds compared with head classes. Since the generalization error bound usually negatively correlates to the magnitude of classification margin, increasing the classification margins for the tail classes will minimize their generalization error bounds.
In light of the above, Balanced Meta-Softmax, an unbiased extension of standard Softmax, adjusts the classification margin for class based on its sample number , and poses the addictive margin to the confidence score predicted by the classifier:
However, for long-tailed data with noisy labels, can no longer reflect the real sample number of class due to the existence of label noise. In addition, manually pre-defining the margin based solely on the sample number largely ignores the distinct classification difficulties of different classes.
We hence present a learnable margin generator
, implemented by a two-layer MLP, to dynamically adjust the margin for each class by optimizing a learnable margin vectorfrom an initial all-ones vector during classifier training:
By integrating the margin vector into the standard Softmax loss, we have:
Since the classification margin is in our formulation, the learned margin for class should be positively correlated to its sample number.
Hence the margin generator is capable of adjusting per-class margins automatically by adapting to the true class distribution underlying the long-tailed noisy data and the classification difficulty of each class in a learnable manner, with no manual interventions nor prior information required.
Hierarchical Sampling Strategy
We integrate the label corrector and the margin generator into a unified meta net , the key component of our dynamic loss. We apply meta-learning to optimize and guide the learning of the classifier to well adapt to balanced and clean test data.
Performing meta-learning requires to build a meta set comprised of a small amount of balanced and clean data . Intuitively, samples with lower classification loss computed by tend to have correctly assigned labels. Hence we can build the meta set simply by selecting samples with the lowest loss values from each class in . However, since easier samples usually have lower loss values throughout the training, such a sampling strategy tend to select fixed easy samples at each epoch, making the model prone to overfitting to easy samples.
Hereby, we design a hierarchical sampling strategy to build through a two-step process: i) construct a primary set by randomly sampling samples from each class in ; ii) select low-loss samples from each class in the primary set to form the final meta set. The remaining samples in make up the counterpart set , as illustrated in Fig. 2.
The benefits of introducing the additional primary set are twofold: i) the samples in the primary set are randomly sampled at each epoch, which guarantees the resulting meta set to be distinct across different epochs; ii) the primary set has fewer samples than , which enables a larger probability of selecting hard yet clean samples near the decision boundary into the meta set. As a result, our hierarchical sampling strategy ensures both the dynamism and diversity of the meta set, preventing the model from overfitting to biased data.
and respectively denote the parameters of the meta net and the classifier at iteration . We randomly sample two mini-batches and from and , respectively, and update and alternatively as follows.
Update . The meta net is trained to guide the learning of the classifier on by correcting per-sample labels and adjusting per-class margins, such that can well adapt to the balanced and clean . We first virtually update on the dynamic loss with :
where is the learning rate for the classifier and is the batch size. Then we update by minimizing the loss of the virtually updated classifier on :
where is the learning rate for the meta net.
Update . We update by minimizing the loss of on with corrected labels and adjusted margins by the updated :
The above two steps are repeated over iterations so that and are optimized alternatively until convergence.
|Dynamic Loss (Ours)||79.73||80.55||48.56||48.98||80.12||93.64||74.76||93.08|
We test our method on both synthetic and real-world long-tailed noisy data, and then present ablations to validate our design choices. More quantitative and qualitative results and training details are provided in supplementary materials.
Methods in Comparison. We compare with three types of methods that are respectively designed to address: i) both label noise and class imbalance, such as HAR, ROLT, FaMUS xu2021fasterand CurveNet; ii) label noise, such as Co-teaching han2018co, SELFIE song2019selfie, DivideMix, ELR+ liu2020early, NCT chen2022compressing, PLC prog_noise_iclr2021,GJS englesson2021generalized, and CMW-Net shu2022cmw; and iii) class imbalance, such as Focal Loss, CB focal cui2019class, LDAM and Balanced Softmax.
Experiments on Long-Tailed Noisy Data
Results on CIFAR-N-LT. We evaluate our method on CIFAR-N-LT including CIFAR-10 and CIFAR-100 with simulated label noise and class imbalance. We first simulate the long-tailed dataset by following the exponential profile cao2019learning, with imbalance ratio . The long-tailed imbalance follows an exponential decay in the sample number across different classes. We then inject label noise to the long-tailed dataset to form the training set. In particular, the label of each sample is independently changed to class with probability , where is the total number of training samples and denotes the frequency of class . Following ROLT, we consider the imbalance ratio of and the noise rate of . We adopt the ResNet-32 he2016deep as the classifier.
Table 1 depicts the average accuracy on CIFAR-10 and CIFAR-100 with various imbalance ratios and noise rates, respectively. Our method retains high accuracy under a wide range of degrees for both biases, while previous methods degrade rapidly. In particular, our dynamic loss improves the average last accuracy by and compared to HAR on CIFAR-10-N-LT and CIFAR-100-N-LT, respectively. Moreover, it significantly outperforms the baseline model that simply fuses strategies from DivideMix and Balanced-Softmax.
It is worth mentioning that previous methods rely on carefully tuned hyper-parameters based on unobservable noise rate li2020dividemix or perform two-stage training to get prior information of class distribution cao2020heteroskedastic. In comparison, our dynamic loss uses a fixed set of hyper-parameters and requires only one-round end-to-end training without any manual interventions in all the experiments.
Results on Webvision. We also evaluate our dynamic loss on WebVision, a real-word long-tailed noisy dataset. We adopt Inception-ResNet V2 szegedy2017inception as the classifier by following priors li2020dividemix. Table 1 shows the results on WebVision and ImageNet deng2009imagenet validation set. Our method significantly outperforms other competing methods despite most of them are equipped with additional model cotraining and ensembling strategies. Notably, compared to HAR and ROLT+ that are dedicated to dealing with long-tailed noisy data, our method boosts the accuracy by at least on WebVision, which clearly demonstrates its superiority.
Experiments on Noisy Data
Results on CIFAR-N. We test the performance of our dynamic loss in dealing with purely noisy data on CIFAR-10-N and CIFAR-100-N with symmetric noise rates and asymmetric noise rates . We adopt PreAct ResNet (PARes18) he2016identity as the classifier by following DivideMix. Table 2 depicts our method achieves the best average accuracy compared to previous methods specially designed for learning on noisy data. In contrast to DivideMix that requires to manually tune the hyper-parameters under different noise types and rates, our dynamic loss well adapts to various noisy scenarios in a fully self-adaptive manner without manual interventions.
Results on Animal-10N. We also test on Animal-10N real-world noisy dataset. For a fair comparison with priors song2019selfie, we adopt randomly initialized VGG19-BN simonyan2014very as the classifier. Table 2 shows our dynamic loss achieves state-of-the art performance among all priors, which clearly proves its superiority in dealing with real-world noisy data.
Experiments on Long-Tailed Data
Results on CIFAR-LT. We also test the performance of our dynamic loss in dealing with purely long-tailed data on clean CIFAR datasets with varying imbalance ratios . Table 3 depicts our method achieves the best performance compared to previous methods specially designed for learning on long-tailed data. Especially, compared with LDAM that adjusts classification margins based solely on the sample number, our method significantly boosts the accuracy by 4.03% and 5.75% on CIFAR-10-LT and CIFAR-100-LT, respectively. It evidences that our dynamic loss is capable of perceiving the classification difficulty of different classes and adjusting their classification margins adaptively.
Results on ImageNet-LT. Table 3 reports the results on ImageNet-LT, a large-scale long-tailed dataset with imbalance ratio . Our dynamic loss achieves the best performance, i.e., in accuracy, demonstrating its strong generalization ability.
Behavior of label corrector. We analyze the behavior of the label corrector on balanced CIFAR-10-N with asymmetric noise that is designed to mimic the structure of real-world label noise by assigning distinct noise rates to different classes. Figure 3 depicts the learned weight by the label corrector and the percentage of noisy labels corresponding to increasing loss bin index on each class. For the classes that contain noisy labels, clean samples mainly appear at top-ranked (low-loss) bins while noisy samples at bottom-ranked (high-loss) bins. This validates our motivation that the loss bin index can serve as a reliable input indicator for the label corrector to distinguish between noisy and clean samples. Accordingly, the generated weight remains to be and suddenly drops to at around bin , showing that the label corrector retains the assigned ground-truth label for clean samples and turns to the predicted label that is more likely to be the ground-truth for noisy samples. While for the classes without noisy labels, remains to be . Consequently, our label corrector always outputs the correct labels for both noisy and clean samples of different classes.
Behavior of margin generator. We analyze the behavior of the margin generator on clean CIFAR-10-LT with imbalance factor . We visualize its generated margins for different classes in the left subfigure of Figure 4. Generally, as the class index increases, the sample number decreases, and the learned margin also decreases as expected. It suggests that the margin generator automatically figures out the sample number of different classes and adjusts their margins accordingly. Interestingly, we see some irregularly larger margins on class and . This can be explained by the right subfigure in which we visualize the feature distribution of meta set using T-SNE van2008visualizing. The feature distribution of these two classes correspond to the two rightmost clusters, indicating they are easier to be distinguished from the other classes. This evidences that the margin generator takes into account not only the sample number, but also the classification difficulty of each class, to generate comprehensively adaptive margins during classifier training.
Behavior of hierarchical sampling. We investigate how our hierarchical sampling strategy improves meta set construction. Figure 5 illustrates samples’ feature distribution of the meta set built by our hierarchical sampling (left) and by naive sampling (right). Compared with naive sampling, the feature distribution of samples selected by hierarchical sampling are more spread out within separate clusters. It evidences that building a primary set prior to random sampling enables more diverse meta data covering both easy and hard samples, and thus alleviates biased learning on easy samples.
Effect of label corrector. We build a model variant in which the label corrector is excluded to evaluate its effectiveness. Table 4 depicts its average accuracy decreases by compared with complete dynamic loss, which well evidences the effectiveness of the label corrector.
Effect of margin generator. As shown in Table 4, to verify the effectiveness of the margin generator, we firstly only employ the label corrector, which achieves the average last accuracy of 71.08%. Thereafter, we introduce Balanced-Softmax to deal with class imbalance, which only lifts a little accuracy by 1.56%. Finally, we boost the accuracy to 79.77% by equipping our margin generator . These evidence a dynamic margin is necessary to long-tailed noisy data.
Effect of hierarchical sampling. As shown in Table 4, replacing the hierarchical sampling with naive random sampling results in up to average last accuracy drop, which indicates the meta set constructed by hierarchical sampling has a more similar distribution to the test set.
Effect of class-specific label corrector. To valid the class-specific design of our label corrector, we build a class-agnostic variant and evaluate it on CIFAR-10-N with 40% asymmetric noise that holds different noise rates across different classes. The results show that our class-specific label corrector significantly outperforms its class-agnostic counterpart by a large margin of 4.0% in accuracy (94.51% vs. 90.56%), which clearly proves our class-specific design.
Effect of meta net architecture. To validate the architecture design of meta net, we simplify the label corrector and the margin generator to a -length and a -length learnable vector, respectively. Table 4 shows such modifications lead to a noticeable performance drop (0.27%). The explanation, which is supported by our experimental observation, is that MLPs learn proper label weights and per-class margins very quickly, while the learnable vectors, unfortunately, suffer from slow convergence.
Test on different classifier. To validate the generality of our method, we further evaluate it using the PARes18. We respectively set the imbalance ratios and noise rates to and and present the mean accuracy in Table 5. It shows our method also outperforms HAR on both CIFAR-10-N-LT and CIFAR-100-N-LT. This evidences our dynamic loss is applicable to various classifiers.
This work presents a new dynamic loss for robust learning from long-tailed data with noisy labels. The dynamic loss comprises a learnable label corrector and margin generator, which is capable of jointly correcting noisy labels and adjusting classification margins to guide the learning of a classifier. The meta net and the classifier are co-optimized through meta-learning, thanks to a new hierarchical sampling strategy that helps provide unbiased yet diverse meta data. Extensive evaluations on both synthetic and real-world data show our dynamic loss is effective and has high adaptability and robustness to various types of data biases.
Appendix A Algorithm Pseudocode
Algorithm 1 depicts the detailed learning process. We begin by a warm-up stage to pre-train on the entire training set to possess preliminary classification capability. Then we enter a robust learning stage to optimize and in an alternative manner. Concretely, at the beginning of each epoch, we first construct a small meta set from , which is balanced and almost clean, by selecting samples with low classification loss value computed by the latest . The remaining samples of form a large counterpart set . Details of the optimization are presented in manuscripts. Source code is available at code https://anonymous.4open.science/r/dynamic˙loss-7BED. Since the hyperlink is forbidden, please copy and paste the URL. There may are some mistakes directly copying from PDF, please check it.
Appendix B Additional Visualizations
CIFAR-N. Figure 6 depicts the accuracy of corrected labels, computed as the proportion of samples with ground-truth labels after label correction. The label accuracy gradually increases as the training proceeds and the classifier becomes more trustworthy, and eventually reaches over on CIFAR-10-N with noise rates 0.2 and 0.4. Notably, the accuracy also has been greatly improved by under a noise rate of . The high accuracy of corrected labels validates our design choices from two perspectives: i) this supports our assumption that the classifier normally dedicates to fitting in the dominated clean samples, and can transfer the learned knowledge to noisy samples to predict the ground-truth labels for them; ii) the label corrector can accurately recognize noisy samples and correct their labels with the predicted correct ones.
shows the learnable weight varied with training epochs. From it we can know, at the beginning the label corrector tends to use the given label to train the classifier while it gradually changes to believe prediction label for the samples with large rank when the classifier is trained 45 epochs. Moreover, the label corrector accurately estimates the noise rate is about 35% (for noise rate 40%, there are actually 35% noisy samples). From this plot together with Figure6, we find the epoch of label corrector believing the classifier is delayed as the noise rate increases. We believe the label corrector thinks the classifier needs to be trained more epochs to generate more reliable prediction labels as noise rate of train set increases. It evidences the label corrector can dynamically relabel the noisy labels according to the status of classifier and train set.
CIFAR-LT. Figure 8 visualizes the margins generated by the margin generator under different imbalance ratios. One can see that despite of varying imbalance ratios, the generated margin consistently decays as the sample size grows (corresponding to increasing class index). Moreover, the learned margin variation across different classes tends to enlarge as the imbalance ratio increases. Both quantitative (in the manuscripts) and qualitative analyses evidence that the margin generator well respects and adapts to various class distributions by learning to assign proper margins automatically.
Figure 9 shows the classification margins and feature distributions of meta set varied with training epochs. From it we can know, the margin generator always adjusts the classification margins according to the feature distributions of meta set in the training process. Take the class as an example, the margin generator tends to give a small classification margin to it since it is hard to recognise at the epoch 50. However, recognising it becomes easy at the epoch 250 and then the margin generator gives it a more large classification margin. It evidences the margin generator can dynamically adjust the classification margin according to classification difficulty.
Webvision. Considering the large number of categories in WebVision, we select 10 categories in intervals of 5 to visualize their learned label weights. As shown in Figure 10,the learned label weights are varied with different classes, suggesting the noise rates of different classes are different, which is consistent with real-world dataset. Moreover, the generated margins over different classes accord with the complex variation of sample size, as shown in Figure 11. This demonstrates our method adapts well to complicated real-world biased data.
Animal-10N. We also analyze the behavior of label corrector on real-word noisy data Animal-10N, as in Figure 12. Most of the learned label weight remains to be and drops to at around bin (red dotted line), suggesting the noise rate estimated by the label corrector is about , which is consistent with the well-recognized estimated noise rate on Animal-10N song2019selfie.
We also analyze the behavior of margin generator on long-tailed data ImageNet-LT, as in Figure 13. Considering the large number of categories in ImageNet-LT, we select 333 categories in intervals of 3 to visualize their learned margin. As shown in Figure 13,the learned margins are consistently varied with sample number of classes, suggesting the margin generator can generate proper margin for different classes.
Appendix C Test on unbiased dataset
Methods specially designed to cope with data bias may sometimes cause performance degradation on unbiased data. We evaluate different methods for robust learning on unbiased data. Table 6 shows some priors like DivideMix and HAR suffer noticeable accuracy drop, while our method still performs well on both CIFAR-10 and CIFAR-100. This shows our dynamic loss can adaptively set up proper learning objectives depending on the situations of training data.
Appendix D Training Time Analysis
Since the training time is the most concerned issues of meta learning, we also evaluate the total training time of our methods following DivideMix. Thanks to the solution to accelerate the training speed of meta learning by FaMUS xu2021faster and CurveNet jiang2021delving, we train a model using about 7.2h with NVIDIA GTX 1080 ti, which is a little slower than DivideMix with Nvidia V100 GPU (5.2h). It evidences the efficiency of our method.
Appendix E Additional Dataset Details
CIFAR. Both of CIFAR-10 and CIFAR-100 consist of 60,000 RGB images (50,000 for training and 10,000 for testing) and are equally distributed to 10 and 100 categories respectively.
CIFAR-N. CIFAR-N is simulated noisy data set based on CIFAR. Commonly simulated label noise types include symmetric and asymmetric noise. Symmetric noise is generated by randomly changing the labels with all possible labels according to a fixed probability of (noise rate). Asymmetric noise is manually designed to mimic the real-world label noise, where labels are only changed with those in similar classes such as deer horse and dog cat.
CIFAR-LT. CIFAR-LT is simulated long-tailed data set based on CIFAR, which reduces the number of training samples per class according to an exponential function . The , and is class index, the amount of samples of -class and the maximum amount of samples in all classes.
Webvision. WebVision li2017webvision is a large-scale real-world dataset with both label noise and class imbalance. It contains 2.4 million images among which about are mislabeled chen2021two. Following MentorNet jiang2018mentornet, we create the miniWebVision dataset by selecting the images of top classes, with an observed imbalance ratio of about . We test the model on the val set of WebVision and ImageNet deng2009imagenet.
Animal-10N. ANIMAL-10N dataset contains 5 pairs of confusing animals with a total of 55,000 images (50,000 for training and 5,000 for testing), and they are equally distributed to all categories. These images are crawled from website with the given labels as the search keyword, which inevitably has a lot of noise. The noise rate is estimated to be around 8%.
ImageNet-LT. The ImageNet-LT contains 115.8K images distributed to 1,000 categories according to Pareto distribution, where the amount of images per class ranges from 5 to 1280.
Appendix F Additional Training Details
|Learning scheduler||Cosine Annealing|
All of the training details are presented in Table 7. For the meta net, we adopt the same setting for its generalizations. Adam optimizer kingma2015adam with a fixed learning rate 3e-3 and a weight decay of 0 is employed.
CIFAR. For CIFAR, we follow balanced-softmax to train all the classifier for 300 epochs with batch size 512. The learning rate is initialized as and controlled by cosine annealing learning scheduler loshchilov2016sgdr. We train all classifiers with the same SGD optimizer for 300 epochs with a momentum of 0.9 and a weight decay of 5e-4. Following balanced-softmax, the RandomCrop, RandomFlip, AutoAugment are adopted for data augmentation. For fair comparison, we adopt the same data augmentation as in AutoAugment cubuk2019autoaugment to all these methods except for CurveNet jiang2021delving due to incompatibility.
Webvision. We train the Inception-ResNet V2 for epochs using SGD optimizer with momentum and weight decay 1e-4. The number of epochs for the warm-up stage is and the bath size is 64. The learning rate is initialized as and controlled by cosine annealing learning scheduler.
Animal-10N. We train the VGG10-BN for epochs using SGD optimizer with momentum and weight decay 5e-4. The number of epochs for the warm-up stage is and the bath size is 128. The learning rate is initialized as and controlled by cosine annealing learning scheduler.
ImageNet-LT. Following previous works liu2019large, we train the ResNet-10 for epochs using SGD optimizer with momentum and weight decay 1e-4. The number of epochs for the warm-up stage is and the bath size is 128. The learning rate is initialized as and controlled by cosine annealing learning scheduler.
Appendix G Detailed Results
Here we provide the detailed accuracy of all the settings. Tables 8 and 9 correspond to the CIFAR-10-N-LT and CIFAR-100-N-LT in Table 1 in manuscript. As shown in them, the performance of our last model is generally very close to that of our best model despite of varying bias settings. In contrast, the last models of DivideMix and Balanced-Softmax degrade significantly compared with their corresponding best models, especially on CIFAR-10 with severe imbalance and noise (e.g., and ). This indicates our method is much more resistant to overfitting on biased data than other methods.