generalizationconfusion
None
view repo
A recentlyproposed technique called selfadaptive training augments modern neural networks by allowing them to adjust training labels on the fly, to avoid overfitting to samples that may be mislabeled or otherwise nonrepresentative. By combining the selfadaptive objective with mixup, we further improve the accuracy of selfadaptive models for image recognition; the resulting classifier obtains stateoftheart accuracies on datasets corrupted with label noise. Robustness to label noise implies a lower generalization gap; thus, our approach also leads to improved generalizability. We find evidence that the Rademacher complexity of these algorithms is low, suggesting a new path towards provable generalization for this type of deep learning model. Last, we highlight a novel connection between difficulties accounting for rare classes and robustness under noise, as rare classes are in a sense indistinguishable from label noise. Our code can be found at https://github.com/Tuxianeer/generalizationconfusion.
READ FULL TEXT VIEW PDFNone
Most modern machine learning algorithms are trained to maximize performance on meticulously cleaned datasets; relatively less attention is given to robustness to noise. Yet many applications have noisy data, and even curated datasets like ImageNet have possible errors in the training set
russakovsky2015imagenet.Recently, Huang et al. huang2020self introduced a framework called selfadaptive training that achieves unprecedented results on noisy training data. Selfadaptive training augments an external neural network through combining two procedures: it adjusts training labels that the model concludes are likely to be inaccurate, and it gives lower weight to training examples for which it is most uncertain. Intuitively, selfadaptive models are in some sense able to recognize when they are “confused” about examples in the training set. They fix the labels of confusing examples when they are convinced that the confusion is likely to be the result of incorrect labels; meanwhile, they downweight the examples that do not seem to match any of the classes well.
We develop some theory for selfadaptive training and find that the reweighting process should perform best when the weights accurately reflect indistribution probabilities, which occurs precisely when the underlying model is wellcalibrated. This leads us to augment
huang2020self with a calibration process called mixup, which feeds models examples from a derived continuous distribution.Our selfadaptive mixup framework combines the label noise robustness of selfadaptive training with the smoothing nature of mixup. In experiments on CIFAR10 and CIFAR100, selfadaptive mixup outperforms the previous noisylabel stateoftheart under almost all levels of noise.
Additionally, we find that selfadaptive mixup generalizes especially well, with generalization gaps an order of magnitude below those of standard neural networks. These findings are consistent with the theoretical result from shalev2014understanding
that high accuracy under label noise implies a lower generalization gap. And we might intuitively expect that selfadaptive methods should generalize well, as they in some sense focus on the most “representative” datapoints—the data points about which they have high confidence early on—and place less weight on outliers that other models overfit.
Building on the empirical generalization performance of our selfadaptive framework, we examine several properties of the method that we hope may provide a path to a formal proof of generalization. We show experimentally that unlike standard neural networks, selfadaptive frameworks do not fit random labels—in fact, after label correction, the model’s posterior classconditional probabilities are all essentially , irrespective of the initial labels. This shows that the experimental Rademacher complexity of selfadaptive training may be low; a corresponding theoretical result, if found, would directly imply a bound on the generalization gap. Additionally, selfadaptive methods do not fully fit the training data, which implies that the generalization gap (between training error and test error) can be closed from both sides. Finally, selfadaptive training does not fit rare classes when the distribution is imbalanced. Although this at first seems like a deficiency, we note that it is a necessary consequence of robustness to label noise—and also implies a certain form of robustness to overfitting to outliers, which is key to generalization.
Taken together, our results on generalization point to the possibility that selfadaptive models’ ability to work around confusing examples in fact drives strong generalization performance. Indeed, their low Rademacher complexity arises precisely because they recognize completely noisy data sets as especially confusing and avoid conjuring faulty patterns. The other properties we highlight arise because some examples in the training data are more confusing than others, and selfadaptive models treat those examples as unlikely to be representative of their labeled classes.
Overall, our work highlights the power of the selfadaptive framework and presents direct improvements on the stateoftheart for classification and generalization under label noise. We also give conceptual and empirical support for the possibility that selfadaptive frameworks may eventually admit a full proof of generalization—a type of argument that has remained elusive for neural networks.
At a high level, there are two prior classes of approaches for learning from noisy labels: label correction attempts to fix the labels directly tanaka2018joint; thulasidasan2019combating; nguyen2020self; loss correction
attempts to fix the loss function through methods like backward and forward correction (see, e.g.,
DBLP:conf/cvpr/PatriniRMNQ17). Label correction, however, tends to be slow in practice because they involve multiple rounds of training. And as lukasik2020does recently highlighted, loss correction may have significant calibration errors.Selfadaptive training huang2020self elegantly combines both approaches, altering the loss function and fixing the labels endogenously during training. The approach we introduce here strengthens the loss function correction of the selfadaptive approach by bringing the weights used in the model closer to the theoretical ideal. (See Tables 2 and 3 for experimental comparisons.)
Showing uniform convergence is a classic approach to proving generalization, but DBLP:conf/nips/NagarajanK19 demonstrated that it cannot be proven for standard neural networks. Nevertheless, DBLP:conf/nips/NagarajanK19 left the door open to the possibility that uniform convergence could be shown for neural networks that are trained by different approaches, such as those described in this paper.
While we are not able to give a full proof of generalization for selfadaptive models in this work, we gesture in that direction by providing both theoretical and empirical evidence that our models generalize in a way that might be provable. In particular, we give evidence that proving generalization of selfadaptive models may be easier than for standard neural networks (see Section 4).
Last, our work is to some degree related to the pathbreaking methods of DBLP:conf/icml/KohL17
, who used “influence functions” to estimate the impact of each data point on model predictions. This method reveals the data points with the largest impact on the model, and in the case of noisy labels,
DBLP:conf/icml/KohL17 showed that by manually correcting the labels of these mostrelevant data points one can substantially improve performance. The selfadaptive approach in some sense employs a similar feedback process, but without the need for a human in the loop.Incorrect data, in the form of label noise, is a significant problem for machine learning models in practice. Neural networks are robust to small numbers of faulty labels, but drop in performance dramatically with, say, 20% label noise. Selfadaptive training, proposed by huang2020self
, is an augmentation that can be performed atop any training process to increase robustness to label noise. After a fixed number of startup epochs, the model begins adjusting the labels of each example in the training set towards the model’s own predictions. In addition, rather than being given equal weight, each example in the training set is
confidenceweighted: that is, each example’s contribution to the loss function is weighted by the maximum predicted class probability—a wellknown metric for a model’s confidence about an example’s classification hendryks2016baseline. Intuitively, a selfadaptive model recognizes when it is “confused” by an example and proceeds to downweight that example’s contribution to the loss function, with the eventual goal of correcting the labels that seem most likely to be noisy.Selfadaptive training modifies regular training in two ways:
Label Correction: During each iteration of training, the model updates its stored labels of the training data (known as “soft labels” and denoted for example ) to be more aligned with the model’s predictions.
Reweighting: The ’th sample is weighted by , the maximum predicted class probability, to reduce the weight of examples less likely to be indistribution.
To gradually move the soft labels towards the predictions, the model uses a momentum update: . Intuitively, when the soft label is not updated at all, and when the soft label is changed completely to match the model’s prediction. Thus corresponds to regular training, and essentially corresponds to early stopping, because if the labels are immediately set to match the predictions, then the model will cease updating (having already minimized the loss function). Thus, letting vary between and interpolates between early stopping and regular training in some sense—and may allow combining the benefits of both.
Additionally, in each step of the algorithm, the samples’ contributions to the model are reweighted according to , corresponding to a measure of confidence in its ability to classify sample . If one of a sample’s predicted class probabilities is substantially higher than others (, ), then the sample is more likely to be representative of class . Furthermore, especially in the context of noisy labels, a low value of corresponds to the model being “confused” about sample , in the sense that it thinks that sample could come from many different classes. One source of confusion could be an outofdistribution sample due to a noisy label, and it makes sense to assign lower weight to such possibly outofdistribution examples.
Note that the two features of selfadaptive training are highly complementary: unlike approaches that simply discard examples with low maximum predicted class probability, selfadaptive training can eventually bring the weight for those examples back up—allowing the model to make full use of those samples if and when it becomes confident that it classifies them correctly.^{1}^{1}1This may happen after those examples themselves have been relabeled, or when enough other examples have been relabeled. For completeness, we reproduce the selfadaptive training algorithm of huang2020self as Algorithm 1 in Appendix A.
In selfadaptive training, the label correction step allows the model to avoid discarding samples it is confident are wrongly labeled (or worse, overfitting to them); instead, the model adjusts the labels of those samples and learns from the corrected data. The reweighting component makes use of the fact that the model believes some samples are more likely than others to be indistribution. Note that while selfadaptive training must determine which labels are incorrect, it takes advantage of the fact that the model’s performance on individual training examples actually directly implies a set of indistribution probabilities.
A main observation of huang2020self is that in scenarios with high amounts of label noise, modern neural networks will initially learn, or “magnify,” the simple patterns in the data, achieving higher accuracy on the test set than is even present in the (noisy) training set. However, in later epochs the models train to essentially 100% accuracy on the training set, thereby overfitting to the noise (see Figure 1, which we reproduce from huang2020self). Selfadaptive training mitigates this problem by correcting that noise when earlier epochs are convinced the labels are wrong.
However, the preceding explanation cannot be the entire story, as huang2020self also found that selfadaptive training improves test accuracy on ImageNet with no label noise. While it is possible that the base ImageNet dataset itself may have some latent label noise in the form of incorrectly classified images, we propose a more nuanced theory that expands on the idea of magnification mentioned above.
Last year, li2019gradient showed that early stopping is provably robust to label noise—and more broadly, it is believed that the early epochs of training produce simpler models that are more easily representable by a neural network. It is natural to think that the data points that feed into those early networks are somehow more easily interpreted than others. In that case, by earlyepoch label correction, selfadaptive training has the ability to downweight confusing and difficulttolearn examples; and by doing so, it may have greater ability to extract and learn the “main ideas” from a dataset.
Despite all the advantages just described, it seems that the selfadaptive training framework still leaves some room for improvement. In an ablation study, huang2020self attributed almost all of the gains from selfadaptive training to the label correction component, while reweighting provided only marginal further benefits. Prima facie, this is surprising because reweighting can mask nonrepresentative examples—especially relevant in the presence of noise.
We now provide a novel mathematical derivation which suggests that the choice of weights used in the prior selfadaptive training setup might be preventing reweighting from achieving its full potential. Formally, the goal of the model is to minimize the true loss
Now, suppose that there are many samples , where some come from a distribution and some come from a distribution . Let the datapoint be drawn from the distribution with probability and with probability .
For many kinds of noise (including uniform label noise), is constant for all . Since to compare models it suffices to compute loss up to translation, we assume that is always .
For any weights (), we can write an unbiased Monte Carlo estimator of the true loss as (modulo some algebra)
(1) 
We seek to choose the
to minimize the variance of the estimator (
1). The scaling of the is irrelevant, so treat as constant. Under the assumption that the variance of is for all , and since the pairs are independent, the variance of the numerator of (1) is just —a constant. Thus, minimal variance is achieved in (1) when the denominator is maximized. By the CauchySchwarz inequality, for fixed , the denominator is maximized with .The preceding derivation shows that in theory the optimal weights are proportional to the indistribution probabilities of each point. The formulation of selfadaptive training by huang2020self uses as weights the maximum predicted class probability. hendryks2016baseline suggests that the magnitudes of these weights generally increase with indistribution probabilities, but that the relationship is not linear. As a consequence, we might be able to further improve the selfadaptive model by tuning the weights to be more proportional to the indistribution probabilities; we do this in the next section.
To get closer to optimal weights in selfadaptive training, we draw upon a class of methods that have been developed to better quantify predictive uncertainty. Specifically, a model is said to be calibrated if when it suggests a sample has probability of being of class
, the correct posterior probability of class
is indeed —exactly the property we show in Subsection 2.2 to be necessary for optimal reweighting in selfadaptive training.As noted by guo2017calibration, modern neural networks tend to be poorly calibrated outofthebox. And—recalling our analysis from the previous section—if the model underlying our selfadaptive training process is poorly calibrated, then the weights it produces will not match the theoretical ideal. We thus pair selfadaptive training with a calibration method called mixup DBLP:conf/iclr/ZhangCDL18. As we show in experiments, using mixup to improve the weights of selfadaptive training yields substantial improvements.
Mixup is a data augmentation procedure proposed by DBLP:conf/iclr/ZhangCDL18 that is known to improve calibration (DBLP:conf/nips/ThulasidasanCBB19) when applied to standard models. Under mixup, instead of training on the given dataset , the model trains on a “mixed up” dataset, draws from which correspond to linear combinations of pairs of datapoints in .
Formally, given a dataset of images and labels and a smoothing parameter , mixup considers convex combinations
where . Loss is normally computed via the standard crossentropy loss on and .
In the paper introducing mixup (DBLP:conf/iclr/ZhangCDL18), the authors interpreted the approach as encouraging the model to linearly interpolate between training examples, as opposed to training on only extreme “all or nothing” examples. In the context of calibration, this alleviates model overconfidence by teaching the model to output nonbinary predictions.
While mixup applies outofthebox to a standard training process, the two main aspects of selfadaptive training—reweighting and label correction—both potentially conflict with mixup. We thus develop the selfadaptive mixup algorithm, which integrates mixup into the selfadaptive paradigm. Selfadaptive mixup takes in two parameters: the used by standard mixup, and a “cutoff” parameter . As in selfadaptive training, the algorithm maintains a set of soft labels for the original training examples. The two components of selfadaptive training are then adjusted as follows:
Label Correction: Selfadaptive training involves updating soft labels of training examples toward model predictions—but mixedup models do not train on examples in the original training set directly. Thus, we need a rule that determines when to update the soft labels; intuitively, we choose to update only when training on a mixedup example that is sufficiently similar to an original example. Specifically, the algorithm only updates soft labels when training on mixedup examples that are at least a proportion of a single original example: i.e., for a mixedup example we update the soft label of iff , and update the soft label of iff .
Reweighting: The reweighting component of selfadaptive training weights the contribution of each example to the loss by the maximum classconditional probability predicted by the model. Analogously, in selfadaptive mixup, we weight each mixedup example’s contribution to the loss by the maximum classconditional probability the model assigns to that mixedup example.
We present the results of experiments testing selfadaptive mixup. We performed experiments on the CIFAR10 and CIFAR100 datasets with uniform label noise rates of . All experiments were conducted with a Wide Resnet34x10 zagoruyko2016wide, following huang2020self. Each individual run took approximately 4 hours to run on one TPU v2.
The results are displayed in the bottom two rows of Table 1. Again following huang2020self, we include for comparison a number of past results of models trained for label noise, including selfadaptive training and (standard) mixup. In selfadaptive mixup, we fix and use two values of , and
. For these hyperparameters,
was chosen intuitively as a low but reasonable value, and the two values of were suggested by past work on mixup DBLP:conf/iclr/ZhangCDL18.^{2}^{2}2Due to lack of computing power, we were only able to test selfadaptive mixup with ; we suggest exploring optimizing this hyperparameter as a potential direction for future research. Finally, each reported selfadaptive mixup accuracy is the median result across three independent runs with different noisy labels and random seeds.CIFAR10  CIFAR100  
Method  Label Noise Rate  Label Noise Rate  
0.2  0.4  0.6  0.8  0.2  0.4  0.6  0.8  
CE + Early Stopping huang2020self  85.57  81.82  76.43  60.99  63.70  48.60  37.86  17.28 
Mixup DBLP:conf/iclr/ZhangCDL18  93.58  89.46  78.32  66.32  69.31  58.12  41.10  18.77 
SCE wang2019symmetric  90.15  86.74  80.80  46.28  71.26  66.41  57.43  26.41 
SAT huang2020self  94.14  92.64  89.23  78.58  75.77  71.38  62.69  38.72 
SAT + SCE huang2020self  94.39  93.29  89.83  79.13  76.57  72.16  64.12  39.61 
Ours ()  94.83  93.72  91.21  80.25  75.21  72.45  65.12  38.96 
Ours ()  95.48  94.15  89.31  74.45  78.03  72.67  62.59  32.65 
As we see in Table 1, selfadaptive mixup improves on the stateoftheart in all but one combination of dataset and noise rate—and the improvement is often substantial.
We see also that as the noise rate increases, the optimal choice of decreases. To see why this might be, note that the mixup ratio is drawn from , which approaches as and approaches as . Recall that label correction happens only when the drawn ratio is not between and . If
is the probability density function of
, labels are updated with probability , which decreases with . Thus, we conjecture that the observed phenomenon occurs because labels are updated via label correction more often when is low, and frequent label correction is more important when labels are more noisy ex ante.The dependence of the optimal on the noise rate is potentially problematic in practice, as the noise rate in realworld datasets is of course not known a priori. However, we believe that noise rates in the wild should be substantially below one half, and thus would recommend in general.^{3}^{3}3Furthermore, note that the selfadaptive approach actually exposes information about the noise rate of the dataset via the frequency of changed labels—which suggests an adaptive approach that varies and/or during training may be fruitful.
We further note that the train and test performances of selfadaptive mixup are much closer to each other than those of selfadaptive training. This suggests—as we discuss this further in the next section—that selfadaptive mixup may submit to guarantees on generalization performance.
In classification, the central metric of interest is the generalization error, which measures how the model performs on the true distribution, as proxied by error on a test set shalev2014understanding. A model’s generalization gap is the absolute difference between its training error and its generalization error. The selfadaptive training models of huang2020self perform especially well on this metric; huang2020self found generalization gaps for their methods to be substantially lower than those of standard neural networks.
Relatedly, shalev2014understanding notes a connection between generalization and stability of models under perturbations to the training set: they consider the effect of replacing a sample from the training set, and note that the difference in the probability of a model classifying the sample correctly between the case in which (1) the model has the sample and the case in which (2) it is replaced is equal to the generalization gap of the model. We consider the setting where data points are not replaced but instead given random labels, which is strictly more difficult. In this setting, better robustness to label noise—such as that of selfadaptive mixup—is strong evidence for low generalization error. For instance, Table 2 suggests that selfadaptive mixup models have even lower generalization gaps than vanilla selfadaptive training.

Models that fit train 100%  SAT huang2020self  Ours ( = 0.2) 

Gen. Gap  >59%  12%  6% 

Beyond striking empirical generalization performance, there are conceptual reasons to think that selfadaptive frameworks might generalize especially well. By construction, selfadaptive models are able to recognize when examples in the training set are “confusing” in the sense that those examples may not be representative of their labeled classes. This is useful when individual labels may be incorrect—but even when there is no label noise, this may also improve generalization by focusing the model on the most representative examples in the training set. Additionally, whereas standard neural networks can achieve 100% accuracy on the training set, selfadaptive models do not do so by construction, because they change some of the labels in the training set; this implicitly tightens the potential generalization gap, which may give a better pathway to theoretical proofs of generalization.
In the remainder of this section, we present several types of evidence suggesting that selfadaptive models may in fact have strong—potentially provable—generalization performance.
The Rademacher complexity of a hypothesis class is defined as the expected value of the maximum correlation between a randomly drawn set of labels and any hypothesis in the hypothesis class. It is wellknown that a model’s generalization error can be bounded by training error plus twice the model’s Rademacher complexity with high probability (see, e.g., shalev2014understanding).
ZhangBHRV17 showed that neural networks can fit arbitrary labels; this implies that the Rademacher complexity of the hypothesis class of neural networks is . By contrast, selfadaptive models cannot fit arbitrary labels—and thus it is plausible that the true Rademacher complexity of the selfadaptive class is low.

Models that fit train 100%  Mixup DBLP:conf/iclr/ZhangCDL18  SAT huang2020self & Ours ( = 0.2) 

Train Acc.  100.0%  12.1%  10.2% 
Gen. Gap  90.0%  2.1%  0.2% 
Indeed, experimental evidence suggests that the empirical Rademacher complexity of selfadaptive training converges to as the training set increases in size: We mimic the ZhangBHRV17 experiment by using selfadaptive training to fit random labels, reporting the results in Table 3. We run a Wide ResNet 34x10 boosted with selfadaptive training in the same manner as originally done by huang2020self on the CIFAR10 dataset, except with entirely random labels (i.e. almost exactly 10% of the labels are correct). In each of several runs, the reweighting process causes the soft labels for every example to converge to almost exactly , and moreover the predicted class for all test images is identical, and is that of the most common class in the training data. In other words, every example is seen as having a probability of being in each of the 10 classes. This suggests that selfadaptive training’s Rademacher complexity is quite low and that arguments that attempt to bound the algorithm’s Rademacher complexity might be fruitful—and as we have already noted, such a result would lead to a proof of generalization.
Theorizing about the extent to which standard neural network models generalize is especially challenging: since neural networks can both (1) fit arbitrary labels and (2) generalize empirically on some datasets, any proof of generalization must be datadependent. This has led to much work proceeding in the direction of showing that specific data manifolds have special properties that support generalization DBLP:conf/iclr/LyuL20; li2019gradient; brutzkus2018sgd.
Furthermore, even stateoftheart neural networks have substantial generalization error. For example, the best test accuracy todate on CIFAR100 is 93.6%—and the highest accuracy of networks trained only on the CIFAR100 training data is just 89.3% cubuk2019autoaugment, although training accuracy is 100%.^{4}^{4}4Even models trained using billions of images from Instagram as in mahajan2018exploring do not get close to 100% testing accuracy on ImageNet or CIFAR100.
If a theorem for neural network generalization were to be found, it would have to take the form
(2) 
But in the case of CIFAR100, any theorem pertaining to a currently existing model would need to have in (2); it is not clear what sort of theoretical argument would yield an explicit bound on that far from .
By contrast, because selfadaptive models have nontrivial error on the training set, the generalization gap for those models can close from both sides. And indeed, the potential value of implied by our empirical results is much smaller than the 10.7% cited above, giving us hope that a formal theorem might be within reach. More broadly, we conjecture that formal results on generalization should be more readily obtainable for classes of models with training accuracy bounded away from 100%.
The algorithmdependent VCdimension, as defined explicitly in DBLP:conf/nips/NagarajanK19, is the VCdimension of the hypotheses that can be obtained by running the algorithm on some set of labels. A bound on the algorithmdependent VCdimension would also imply a generalization bound of , where is the VCdimension and is the size of the training set (shalev2014understanding).^{5}^{5}5Note that such a bound would not contradict DBLP:conf/nips/NagarajanK19
, because the training method is not the standard stochastic gradient descent.
Encouraged by the results of the random labels experiment above, we present an argument that the algorithmdependent VCdimension of selfadaptive training is actually extremely limited.Our key idea is inspired by the observation of DBLP:conf/iclr/SagawaKHL20 that early stopping fails to fit infrequent classes. In DBLP:conf/iclr/SagawaKHL20, it was noted that empirical risk minimization performed relatively poorly in classifying objects in rare classes, only obtaining worstclass test accuracies of 41.1% on CelebA. As a supplement, DBLP:conf/iclr/SagawaKHL20 also tried using early stopping with empirical risk minimization—but surprisingly, this caused performance to decline even more, bringing worstcase test accuracy down to 25%.
Our proposed explanation for the surprising phenomenon in DBLP:conf/iclr/SagawaKHL20 is as follows: It is known that early stopping is resistant to label noise li2019gradient. The bound proven in li2019gradient is adversarial, in that it shows that early stopping resists any proportion of label noise, even if the noise is chosen adversarially. However, note that one form of adversarial label noise is the introduction of a rare class. Intuitively, if a neural network were only given examples corresponding to one class, say, “airplanes,” it would learn to classify everything as an airplane. If the network were given exactly one example corresponding to a “bird” (alongside all the examples labeled as airplanes), ideally the neural network would learn to classify objects similar to the “bird” example as birds. However, noise stability requires that the neural network should not learn to classify similar objects as birds, because an ex ante indistinguishable interpretation is that “all objects are airplanes” and there was one noisy label. Thus, the objectives of noise resilience and rareclass recognition are conflicting.
As a result, rareclass recognition would seem to be particularly difficult for selfadaptive training. Selfadaptive training repeatedly reinforces model predictions by moving the labels closer and closer to the predictions. If by the time the model begins the label correction phase, most of the predictions for examples in rare classes are still incorrect and fail to match the training labels, those training labels will be shifted towards the dominant class in a “tyranny of the majority” effect: almost all of the training labels will become those of the dominant classes.
In Figure 2, we show an experiment that runs selfadaptive training on datasets with imbalanced classes. Specifically, the top graphs show the training curves of vanilla crossentropy training, and the bottom graphs show the training curves of selfadaptive training. For various ratios , the models are trained on a dataset consisting only of airplanes and automobiles (classes 0 and 1) from CIFAR10, where the ratio of airplanes to automobiles is and there are airplanes in the training set (the maximum possible). We then report the worstclass accuracy (i.e. accuracy on test set automobiles), as the bestclass accuracy is essentially 100% in all cases. All experiments are done with a Wide ResNet 34x10, following huang2020self in all unrelated parameters.
Notice that while selfadaptive training frequently outperforms vanilla training in the presence of balanced classes, these experiments show that it is inferior in the regime of imbalanced classes. In particular, the standard crossentropy loss gives perfect accuracies for the majority class and decent accuracies for the minority class, while selfadaptive training performs much worse with respect to the minority class. The two variants perform similarly until epoch 60 (when label correction begins), after which the selfadaptive training accuracies begin to decrease (drastically if ) as the model hypothesizes the minority class examples are mislabeled. This is the essence of the tradeoff: crossentropy training learns the rare class to decent accuracy, while selfadaptive training compromises worstclass accuracy for the sake of robustness to noise.
We have augmented selfadaptive training with mixup, improving calibration—and thus increasing the benefits the model’s reweighting step along lines suggested by theory. The resulting models obtain stateoftheart accuracies for image classification under label noise. We also noticed strong generalization performance, and provided several threads of reasoning that could lead to formal generalization results for selfadaptive frameworks.
Under conventional wisdom, training a classifier to 100% accuracy on a training set should improve test performance—which suggests that as far as training goes, longer is better. However, models trained for less time tend to exhibit superior generalization. Selfadaptive models get the best of both worlds; they have better performance than standard training while also taking advantage of shorter training’s magnification effects. The key idea, we have argued, is that selfadaptive models can recognize when they are confused and adjust their training progression accordingly. A consequence is that selfadaptive models are likely to be useful beyond settings with label noise, and we expect them to be powerful whenever some examples of classes are more representative than others, which is more or less the generic case.
We appreciate the helpful comments of Demi Guo, Daniel Kane, Hikari Sorensen, and members of the Lab for Economic Design (especially Jiafeng Chen, Duncan RheingansYoo, Suproteem Sarkar, Tynan Seltzer, and Alex Wei). Kominers gratefully acknowledges the support of National Science Foundation grant SES1459912, as well as the Ng Fund and the Mathematics in Economics Research Fund of the Harvard Center of Mathematical Sciences and Applications (CMSA). Part of this work was inspired by conversations with Nikhil Naik and Bradley Stadie at the 2016 CMSA Conference on Big Data, which was sponsored by the Alfred P. Sloan Foundation.
The algorithm form of selfadaptive training is reproduced below. In particular, label correction appears on lines 6 and 10, and reweighting appears on lines 8 and 10. In the algorithm, the represent “soft labels” on the examples in the training set, which start out as the (possibly noisy) “onehot” labels. The model trains regularly until epoch , when the model begins updating the soft labels based on current predictions.