Meta-Learning based Noise-Tolerant Training
Despite the success of deep neural networks (DNNs) in image classification tasks, the human-level performance relies on massive training data with high-quality manual annotations, which are expensive and time-consuming to collect. There exist many inexpensive data sources on the web, but they tend to contain inaccurate labels. Training on noisy labeled datasets causes performance degradation because DNNs can easily overfit to the label noise. To overcome this problem, we propose a noise-tolerant training algorithm, where a meta-learning update is performed prior to conventional gradient update. The proposed meta-learning method simulates actual training by generating synthetic noisy labels, and train the model such that after one gradient update using each set of synthetic noisy labels, the model does not overfit to the specific noise. We conduct extensive experiments on the noisy CIFAR-10 dataset and the Clothing1M dataset. The results demonstrate the advantageous performance of the proposed method compared to several state-of-the-art baselines.READ FULL TEXT VIEW PDF
Deep neural networks (DNNs) trained on large-scale datasets have exhibit...
Despite the success of deep learning methods in medical image segmentati...
It has been shown that deep neural networks are prone to overfitting on
Plant diseases serve as one of main threats to food security and crop
Despite the deep neural networks (DNN) has achieved excellent performanc...
Webly supervised learning becomes attractive recently for its efficiency...
Emotion labels in emotion recognition corpora are highly noisy and ambig...
Meta-Learning based Noise-Tolerant Training
One of the key reasons why deep neural networks (DNNs) have been so successful in image classification is the collections of massive labeled datasets such as ImageNet and COCO . However, it is time-consuming and expensive to collect such high-quality manual annotations. A single image often requires agreement from multiple annotators to reduce label error. On the other hand, there exist other less expensive sources to collect labeled data, such as search engines, social media websites, or reducing the number of annotators per image. However, those low-cost approaches introduce low-quality annotations with label noise
. Many studies have shown that label noise can significantly affect the accuracy of the learned classifiers[2, 22, 31]. In this work, we address the following problem: how to effectively train on noisy labeled datasets?
or estimate label confusion[30, 16]. However, those methods exhibit a disadvantage in scalability for large datasets. On the other hand, methods without human supervision (e.g. label correction [18, 23] and noise correction layers [22, 5]
) are scalable but less effective and more heuristic. In this work we propose ameta-learning based noise-tolerant (MLNT) training to learn from noisy labeled data without human supervision or access to any clean labels. Rather than designing a specific model, we propose a model-agnostic training algorithm, which is applicable to any model that is trained with gradient-based learning rule.
The prominent issue in training DNNs on noisy labeled data is that DNNs often overfit to the noise, which leads to performance degradation. Our method addresses this issue by optimizing for a model’s parameters that are less prone to overfitting and more robust against label noise. Specifically, for each mini-batch, we propose a meta-objective to train the model, such that after the model goes through conventional gradient update, it does not overfit to the label noise. The proposed meta-objective encourages the model to produce consistent predictions after it is trained on a variety of synthetic noisy labels. The key idea of our method is: a noise-tolerant model should be able to consistently learn the underlying knowledge from data despite different label noise. The main contribution of this work are as follows.
We propose a noise-tolerant training algorithm, where a meta-objective is optimized before conventional training. Our method can be theoretically applied to any model trained with gradient-based rule. We aim to optimize for a model that does not overfit to a wide spectrum of artificially generated label noise.
We formulate our meta-objective as: train the model such that after it learns from various synthetic noisy labels using gradient update, the updated models give consistent predictions with a teacher model. We adapt a self-ensembling method to construct the teacher model, which gives more reliable predictions unaffected by the synthetic noise.
We perform experiments on two datasets with synthetic and real-world label noise, and demonstrate the advantageous performance of the proposed method in image classification tasks compared to state-of-the-art methods. In addition, we conduct extensive ablation study to examine different components of the proposed method.
Learning with label noise. A number of approaches have been proposed to train DNNs with noisy labeled data. One line of approaches formulate explicit or implicit noise models to characterize the distribution of noisy and true labels, using neural networks [22, 28, 16, 5, 8, 11], directed graphical models 13], or conditional random fields . The noise models are then used to infer the true labels or assign smaller weights to noisy samples. However, these methods often require a small set of data with clean labels to be available, or use expensive estimation methods. They also rely on specific assumptions about the noise model, which may limit their effectiveness with complicated label noise. Another line of approaches use correction methods to reduce the influence of noisy labels. Bootstrap method  introduces a consistency objective that effectively re-labels the data during training. Tanaka et al.  propose to jointly optimize network parameters and data labels. An iterative training method is proposed to identify and downweight noisy samples 
. A few other methods have also been proposed that use noise-tolerant loss functions to achieve robust learning under label noise[27, 4, 3].
Meta-Learning. Recently, meta-learning methods for DNNs have resurged in its popularity. Meta-learning generally seeks to perform the learning at a level higher than where conventional learning occurs, e.g. learning the update rule of a learner , or finding weight initializations that can be easily fine-tuned  or transferred . Our approach is most related to MAML , which aims to train model parameters that can learn well based on a few examples and a few gradient descent steps. Both MAML and our method are model-agnostic and perform training by doing gradient updates on simulated meta-tasks. However, our objective and algorithm are different from that of MAML. MAML addresses few-shot transfer to new tasks, whereas we aim to learn a noise-tolerant model. Moreover, MAML trains using classification loss on a meta-test set, whereas we use a consistency loss with a teacher model.
Several recent methods based on self-ensembling have improved the state-of-the-art results for semi-supervised learning[20, 10, 24], where labeled samples are scarce and unlabeled samples are abundant. These methods apply a consistency loss to the unlabeled samples, which regularizes a neural network to make consistent predictions for the same samples under different data augmentation , dropout and noise conditions . We focus in particular on the self-ensembling approach proposed by Tarvainen & Valpola  as it forms one of the basis of our approach. Their approach proposes two networks: a student network and a teacher network, where the weights of the teacher are the exponential moving average of those of the student. They enforce the student network to make consistent predictions with the teacher network. In our method, we use the teacher network in meta-test to train the student network such that it is more tolerant to label noise.
We consider a classification problem with a training set , where denotes the sample and
is a one-hot vector representing the corresponding noisy label overclasses. Let denotes the discriminative function of a neural network parameterized by , which maps an input to an output of the
-class softmax layer. The conventional objective for supervised classification is to minimize an empirical risk, such as the cross entropy loss:
where denotes dot product.
However, since contains noise, the neural network can overfit and perform poorly on the test set. We propose a meta-objective that encourages the network to learn noise-tolerant parameters. The details are delineated next.
Our method can learn the parameters of a DNN model in such a way as to “prepare” the model for label noise. The intuition behind our method is that when training with a gradient-based rule, some network parameters are more tolerant to label noise than others. How can we encourage the emergence of such noise-tolerant parameters? We achieve this by introducing a meta-learning update before the conventional update for each mini-batch. The meta-learning update simulates the process of training with label noise and makes the network less prone to over-fitting. Specifically, for each mini-batch of training data, we generate a variety of synthetic noisy labels on the same images. With each set of synthetic noisy labels, we update the network parameters using one gradient update, and enforce the updated network to give consistent predictions with a teacher model unaffected by the synthetic noise. As shown in Figure 1, the meta-learning update optimizes the model so that it can learn better with conventional gradient update on the original mini-batch. In effect, we aim to find model parameters that are less sensitive to label noise and can consistently learn the underlying knowledge from data despite label noise. The proposed meta-learning update consists of two procedures: meta-train and meta-test.
Meta-Train. Formally, at each training step, we consider a mini-batch of data sampled from the training set, where are samples, and are the corresponding noisy labels. We want to generate multiple mini-batches of noisy labels with similar label noise distribution as . We will describe the procedure for generating one set of noisy labels . First, we randomly select samples out of the mini-batch of samples. For each selected sample , we rank its neighbors within the mini-batch. Then we randomly select a neighbor from its top 10 nearest neighbors (10 is experimentally determined), and use the neighbor’s label to replace the label for , . Because we transfer labels among neighbors, the synthetic noisy labels are from a similar distribution as the original noisy labels. We repeat the above procedure times to generate mini-batches of synthetic noisy labels. Note that we compute nearest neighbors based on the Euclidean distance between feature representations (pre-softmax layer activations) generated by a DNN pre-trained on the entire noisy training set .
Let denote the current model’s parameters, for each synthetic mini-batch , we update to using one gradient descent step on the mini-batch.
where is the cross entropy loss described in equation 1, and is the step size.
Meta-Test. Our meta-objective is to train such that each updated parameters do not overfit to the specific noisy labels . We achieve this by enforcing each updated model to give consistent predictions with a teacher model. We consider the model parameterized by as the student model, and construct the teacher model parameterized by following the self-ensembling  method. The teacher model usually gives better predictions than the student model. Its parameters are computed as the exponential moving average (EMA) of the student’s parameters. Specifically, at each training step, we update with:
where is a smoothing coefficient hyper-parameter.
Since the teacher model is unaffected by the synthetic label noise, we enforce a consistency loss that encourages each updated model (with parameters ) to give consistent predictions with the teacher model on the same input . We define as the Kullback-Leibler (KL)-divergence between the softmax predictions from the updated model and the softmax predictions from the teacher model . We find that KL-divergence produces better results compared to the mean squared error used in .
We want to minimize the consistency loss for all of the updated models with parameters . Therefore, the meta loss is defined as the average of all consistency losses:
Although is computed using the updated model parameters , the optimization is performed over the student model parameters
. We perform stochastic gradient descent (SGD) to minimize the meta loss. The model parametersare updated as follows:
where is the meta-learning rate.
After the meta-learning update, we perform SGD to optimize the classification loss on the original mini-batch .
where is the learning rate. The full algorithm is outlined in Algorithm 1.
Note that the meta-gradient involves a gradient through a gradient, which requires calculating the second-order derivatives with respect to . In our experiments we use a first-order approximation by omitting the second-order derivatives, which can significantly increase the computation speed. The comparison in Section IV-D shows that this approximation performs almost as well as using second-order derivatives. This provides another intuition to explain our method: The first-order approximation considers the term in equation 7 as a constant. Therefore, we can consider the update as injecting data-dependent noise to the parameters, and adding noise to the network during training has been shown by many studies to have a regularization effect [21, 10].
We propose an iterative training scheme for two purposes: (1) Remove samples with potentially wrong class labels from the classification loss . (2) Improve the predictions from the teacher model so that the consistency loss is more effective.
First, we perform an initial training iteration following the method described in Algorithm 1, and acquire a model with the best validation accuracy (usually the teacher). We name that model as mentor and use to denote its parameters. In the second training iteration, we repeat the steps in Algorithm 1 with two changes described as follows.
First, if the classification loss is applied to the entire training set
, samples with wrong ground-truth labels can corrupt training. Therefore, we remove a sample from the classification loss if the mentor model assigns a low probability to the ground-truth class. In effect, the classification loss would now sample batches from a filtered training setwhich contains fewer corrupted samples.
where is the softmax prediction of the mentor model, and is a threshold to control the balance between the quality and quantity of .
Second, we improve the effectiveness of the consistency loss by merging the predictions from the mentor model and the teacher model to produce more reliable predictions. The new consistency loss is:
where is a weight to control the importance of the teacher model and the mentor model. It is ramped up from 0 to 0.5 as training proceeds.
We train for three iterations in our experiments. The mentor model is the best model from the previous iteration. We observe that further training iterations beyond three do not give noticeable performance improvement.
|Cross Entropy ||93.5||91.0||88.4||85.0||78.4||41.1|
|Cross Entropy (reproduced)||91.840.05||90.330.06||87.850.08||84.620.08||78.060.16||45.850.91|
|Joint Optimization ||93.4||92.7||91.4||89.6||85.9||58.0|
|MLNT-student (1st iter.)||93.180.07||92.160.05||90.570.08||87.680.06||81.960.19||55.451.11|
|MLNT-teacher (1st iter.)||93.210.07||92.430.05||91.060.07||88.430.05||83.270.22||57.391.13|
|MLNT-student (2nd iter.)||93.240.09||92.630.07||91.990.13||89.710.07||86.280.19||58.211.09|
|MLNT-teacher (2nd iter.)||93.350.07||92.910.09||91.890.06||90.030.08||86.240.18||58.331.10|
|MLNT-student (3nd iter.)||93.290.08||92.910.10||92.020.09||90.270.10||86.950.17||58.571.12|
|MLNT-teacher (3nd iter.)||93.520.08||93.240.08||92.500.07||90.650.09||87.110.19||59.091.12|
label noise. We report the mean and standard error across 5 runs.
For CIFAR-10, we split 10% of the training data for validation, and artificially corrupt the rest of the training data with two types of label noise: symmetric and asymmetric. The symmetric label noise is injected by using a random one-hot vector to replace the ground-truth label of a sample with a probability of . The asymmetric label noise is designed to mimic some of the structure of real mistakes for similar classes : TRUCK AUTOMOBILE, BIRD AIRPLANE, DEER HORSE, CAT DOG. Label transitions are parameterized by such that true class and wrong class have probability of and , respectively.
Clothing1M  consists of 1M images collected from online shopping websites, which are classified into 14 classes, e.g. t-shirt, sweater, jacket. The labels are generated using surrounding texts provided by sellers, which contain real-world errors. We use the and clean data for validation and test, respectively, but we do not use the clean training data.
|Cross Entropy ||91.8||90.8||90.0||87.1||77.3|
|Cross Entropy (reproduced)||91.040.07||90.190.09||88.880.06||86.340.22||77.480.79|
|Joint Optimization ||93.2||92.7||92.4||91.5||84.6|
|MLNT-student (1st iter.)||92.89 0.11||91.840.10||90.550.09||88.700.13||79.950.71|
|MLNT-teacher (1st iter.)||93.05 0.10||92.190.09||91.470.04||88.690.08||78.440.45|
|MLNT-student (2nd iter.)||93.010.12||92.650.11||91.870.12||90.600.12||81.530.66|
|MLNT-teacher (2nd iter.)||93.330.13||92.970.11||92.430.19||90.930.15||81.470.54|
|MLNT-student (3nd iter.)||93.360.14||92.980.13||92.590.10||91.870.12||82.250.68|
|MLNT-teacher (3nd iter.)||93.610.10||93.250.12||92.820.18||92.300.10||82.090.47|
For experiments on CIFAR-10, we follow the same experimental setting as  and use the network based on PreAct ResNet-32 . By common practice , we normalize the images, and perform data augmentation by random horizontal flip and
random cropping after padding 4 pixels per side. We use a batch size, a step size , a learning rate , and update using SGD with a momentum of 0.9 and a weight decay of
. For each training iteration, we divide the learning rate by 10 after 80 epochs, and train until 120 epochs. For the initial iteration, we ramp up(meta-learning rate) from 0 to 0.4 during the first 20 epochs, and keep for the rest of the training. In terms of the EMA decay , we use for the first 20 epochs and later on, because the student improves quickly early in the training, and thus the teacher should have a shorter memory . In the ablation study (Section IV-D), we will show the effect of the three important hyper-parameters, namely , the number of synthetic mini-batches, , the number of samples with label replacement, and the threshold for data filtering. The value for all hyper-parameters are determined via validation.
For experiments on Clothing1M, we follow previous works [23, 16] and use the ResNet-50  pre-trained on ImageNet. For preprocessing, we resize the image to , crop the middle as input, and perform normalization. We use a batch size , a step size , a learning rate , and update using SGD with a momentum of 0.9 and a weight decay of . We train for 3 epochs for each iteration. During the first 2000 mini-batches in the initial iteration, we ramp up from 0 to , and set . For the rest of the training, we use and . Other hyper-parameters are set as , , and .
We compare the proposed MLNT with multiple baseline methods using CIFAR-10 dataset with symmetric label noise (noise ratio ) and asymmetric label noise (noise ratio ). The baselines include:
(1) Cross Entropy: conventional training without the meta-learning update. We report both the results from  and from our own implementation.
(2) Forward : forward loss correction based on the noise transition matrix.
(3) CNN-CRF : a CRF model is proposed to represent the relationship between noisy and clean labels. It requires a small set of clean labels during training.
(4) Joint Optimization : alternatively updates network parameters and corrects labels during training.
Both Forward  and CNN-CRF  require the ground-truth noise transition matrix, and Joint Optimization  requires the ground-truth class distribution among training data. Our method does not require any prior knowledge on the data, thus is more general. Note that all baselines use the same network architecture as our method. We report the numbers published in .
. The reason could be the different programming frameworks used (we use PyTorch, whereas  used Chainer ). For both types of noise, the proposed MLNT method with one training iteration significantly improves accuracy compared to Cross Entropy (reproduced), and achieves comparable performance to state-of-the-art methods. Iterative training further improves the performance. MLNT-teacher after three iterations significantly outperforms previous methods. An exception where MLNT does not outperform baselines is with 50% asymmetric label noise. This is because that asymmetric label noise is generated by exchanging CAT and DOG classes, and it is theoretically impossible to distinguish them without prior knowledge when the noise ratio is 50%.
In Table I, we also show the results with clean training data (). The proposed MLNT can achieve an improvement of in accuracy compared to Cross Entropy, which shows the regularization effect of the proposed meta-objective.
Progressive Comparison. Figure 3 plots the model’s accuracy on noisy training data and its test accuracy on clean test data as training proceeds. We show a representative training process using asymmetric label noise with . Accuracy is calculated every epoch, and training accuracy is computed across all mini-batches within the epoch. Comparing the proposed MLNT methods (1st iter.) with Cross Entropy, MLNT learns more quickly during the beginning of training, as shown by the higher test accuracy. Cross Entropy has the most unstable training process, as shown by its fluctuating test accuracy curve. MLNT-student is more stable because of the regularization from the meta-objective, whereas MLNT-teacher is extremely stable because its parameters change smoothly during training. At the 80th epoch, the learning rate is divided by 10, which causes a drastic increase in both training and test accuracy for MLNT-student and Cross Entropy. After the 80th epoch, the model begins to overfit because of the small learning rate. However, the proposed MLNT-student suffers less overfitting compared to Cross Entropy, as shown by its lower training accuracy and higher test accuracy.
Hyper-parameters. We conduct ablation study to examine the effect of three hyper-parameters: . is the number of mini-batches with synthetic noisy labels that we generate for each mini-batch from the original training data. Intuitively, with larger , the model is exposed to a wider variety of label noise, and thus can learn to be more noise-tolerant. In Figure 4 we show the test accuracy on CIFAR-10 for MLNT-student (1st iter.) with ( is the same as Cross Entropy) trained using labels with symmetric noise (SN) and asymmetric noise (AN) of different ratio. The result shows that the accuracy indeed increases as increases. The increase is most significant when changes from 0 to 5, and is marginal when changes from 10 to 15. Therefore, the experiments in this paper are conducted using , as a trade-off between the training speed and the model’s performance.
is the number of samples whose labels are changed in each synthetic mini-batch of size . We experiment with , which correspond to samples with a batch size of 128. Figure 5 shows the performance of MLNT-student (1st iter.) using different trained on CIFAR-10 with different ratio of asymmetric label noise. The performance is insensitive to the value of . For different noise ratio, the optimal generally falls into the range of .
is the threshold to determine which samples are filtered out by the mentor model during the 2nd and 3nd training iteration. It controls the balance between the quality and quantity of the data that is used by the classification loss. In Figure 6 we show the performance of MLNT-student (2nd iter.) trained using different value of . As the noise ratio increases, the optimal value of also increases to filter out more samples. Figure 7 shows some example images from Clothing1M dataset that are filtered out and their corresponding probability scores given by the mentor model.
Full optimization. We have been using a first-order approximation to optimize for faster computation speed. Here we conduct experiments using full optimization by including the second-order derivative with respect to . Table III shows the comparison on four representative sets of experiments with different label noise. We show the test accuracy (averaged across 5 runs) of MLNT-student (1st iter.) trained with full optimization and first-order approximation. The result shows that the performance from first-order approximation is nearly the same as that obtained with full second derivatives. This suggests that the improvement of MLNT mostly comes from the gradients of the meta loss at the updated parameter values, rather than the second-order gradients.
|#1 Cross Entropy ||69.15|
|Cross Entropy (reproduced)||69.28|
|#2 Forward ||69.84|
|#3 Joint Optimization ||72.16|
|MLNT-student (1st iter.)||72.34|
|MLNT-teacher (1st iter.)||72.08|
|MLNT-student (2nd iter.)||73.13|
|MLNT-teacher (2nd iter.)||73.10|
|MLNT-student (3nd iter.)||73.44|
|MLNT-teacher (3nd iter.)||73.47|
We demonstrate the efficacy of the proposed method on real-world noisy labels using the Clothing1M dataset. The results are shown in Table IV. We show the accuracy for baselines #1 and #3 reported in , and the accuracy for #2 reported in . The proposed MLNT method with one training iteration achieves better performance compared to state-of-the-art methods. After three training iterations, MLNT achieves a significant improvement in accuracy of over Cross Entropy, and an improvement of over the best baseline method #3.
In this paper, we propose a meta-learning method to learn from noisy labeled data, where a meta-learning update is performed prior to conventional gradient update. The proposed meta-objective aims to find noise-tolerant model parameters that are less prone to overfitting. In the meta-train step, we generate multiple mini-batches with synthetic noisy labels, and use them to update the parameters. In the meta-test step, we apply a consistency loss between each updated model and a teacher model, and train the original parameters to minimize the total consistency loss. In addition, we propose an iterative training scheme, where the model from previous iteration is used to clean data and refine predictions. We evaluate the proposed method on two datasets. The results validate the advantageous performance of our method compared to state-of-the-art methods. Our code will be available at www.url.com
. For future work, we plan to explore the use of the proposed model-agnostic training method to other domains with different model architectures, such as learning Recurrent Neural Networks for machine translation with corrupted ground-truth sentences.
Cleannet: Transfer learning for scalable image classifier training with label noise.In CVPR, pages 5447–5456, 2018.
Journal of Machine Learning Research, 15(1):1929–1958, 2014.
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.In NIPS, pages 1195–1204, 2017.