Learning to Learn from Noisy Labeled Data

12/13/2018 ∙ by Junnan Li, et al. ∙ 46

Despite the success of deep neural networks (DNNs) in image classification tasks, the human-level performance relies on massive training data with high-quality manual annotations, which are expensive and time-consuming to collect. There exist many inexpensive data sources on the web, but they tend to contain inaccurate labels. Training on noisy labeled datasets causes performance degradation because DNNs can easily overfit to the label noise. To overcome this problem, we propose a noise-tolerant training algorithm, where a meta-learning update is performed prior to conventional gradient update. The proposed meta-learning method simulates actual training by generating synthetic noisy labels, and train the model such that after one gradient update using each set of synthetic noisy labels, the model does not overfit to the specific noise. We conduct extensive experiments on the noisy CIFAR-10 dataset and the Clothing1M dataset. The results demonstrate the advantageous performance of the proposed method compared to several state-of-the-art baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 7

Code Repositories

MLNT

Meta-Learning based Noise-Tolerant Training


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

One of the key reasons why deep neural networks (DNNs) have been so successful in image classification is the collections of massive labeled datasets such as ImageNet 

[19] and COCO [14]. However, it is time-consuming and expensive to collect such high-quality manual annotations. A single image often requires agreement from multiple annotators to reduce label error. On the other hand, there exist other less expensive sources to collect labeled data, such as search engines, social media websites, or reducing the number of annotators per image. However, those low-cost approaches introduce low-quality annotations with label noise

. Many studies have shown that label noise can significantly affect the accuracy of the learned classifiers 

[2, 22, 31]. In this work, we address the following problem: how to effectively train on noisy labeled datasets?

Some methods learn with label noise by relying on human supervision to verify seed images [28, 11]

or estimate label confusion 

[30, 16]. However, those methods exhibit a disadvantage in scalability for large datasets. On the other hand, methods without human supervision (e.g. label correction [18, 23] and noise correction layers [22, 5]

) are scalable but less effective and more heuristic. In this work we propose a

meta-learning based noise-tolerant (MLNT) training to learn from noisy labeled data without human supervision or access to any clean labels. Rather than designing a specific model, we propose a model-agnostic training algorithm, which is applicable to any model that is trained with gradient-based learning rule.

The prominent issue in training DNNs on noisy labeled data is that DNNs often overfit to the noise, which leads to performance degradation. Our method addresses this issue by optimizing for a model’s parameters that are less prone to overfitting and more robust against label noise. Specifically, for each mini-batch, we propose a meta-objective to train the model, such that after the model goes through conventional gradient update, it does not overfit to the label noise. The proposed meta-objective encourages the model to produce consistent predictions after it is trained on a variety of synthetic noisy labels. The key idea of our method is: a noise-tolerant model should be able to consistently learn the underlying knowledge from data despite different label noise. The main contribution of this work are as follows.

  • We propose a noise-tolerant training algorithm, where a meta-objective is optimized before conventional training. Our method can be theoretically applied to any model trained with gradient-based rule. We aim to optimize for a model that does not overfit to a wide spectrum of artificially generated label noise.

  • We formulate our meta-objective as: train the model such that after it learns from various synthetic noisy labels using gradient update, the updated models give consistent predictions with a teacher model. We adapt a self-ensembling method to construct the teacher model, which gives more reliable predictions unaffected by the synthetic noise.

  • We perform experiments on two datasets with synthetic and real-world label noise, and demonstrate the advantageous performance of the proposed method in image classification tasks compared to state-of-the-art methods. In addition, we conduct extensive ablation study to examine different components of the proposed method.

Ii Related Work

Learning with label noise. A number of approaches have been proposed to train DNNs with noisy labeled data. One line of approaches formulate explicit or implicit noise models to characterize the distribution of noisy and true labels, using neural networks [22, 28, 16, 5, 8, 11], directed graphical models [30]

, knowledge graphs 

[13], or conditional random fields [26]. The noise models are then used to infer the true labels or assign smaller weights to noisy samples. However, these methods often require a small set of data with clean labels to be available, or use expensive estimation methods. They also rely on specific assumptions about the noise model, which may limit their effectiveness with complicated label noise. Another line of approaches use correction methods to reduce the influence of noisy labels. Bootstrap method [18] introduces a consistency objective that effectively re-labels the data during training. Tanaka et al. [23] propose to jointly optimize network parameters and data labels. An iterative training method is proposed to identify and downweight noisy samples [29]

. A few other methods have also been proposed that use noise-tolerant loss functions to achieve robust learning under label noise 

[27, 4, 3].

Meta-Learning. Recently, meta-learning methods for DNNs have resurged in its popularity. Meta-learning generally seeks to perform the learning at a level higher than where conventional learning occurs, e.g. learning the update rule of a learner [17], or finding weight initializations that can be easily fine-tuned [1] or transferred [12]. Our approach is most related to MAML [1], which aims to train model parameters that can learn well based on a few examples and a few gradient descent steps. Both MAML and our method are model-agnostic and perform training by doing gradient updates on simulated meta-tasks. However, our objective and algorithm are different from that of MAML. MAML addresses few-shot transfer to new tasks, whereas we aim to learn a noise-tolerant model. Moreover, MAML trains using classification loss on a meta-test set, whereas we use a consistency loss with a teacher model.

Self-Ensembling.

Several recent methods based on self-ensembling have improved the state-of-the-art results for semi-supervised learning 

[20, 10, 24], where labeled samples are scarce and unlabeled samples are abundant. These methods apply a consistency loss to the unlabeled samples, which regularizes a neural network to make consistent predictions for the same samples under different data augmentation [20], dropout and noise conditions [10]. We focus in particular on the self-ensembling approach proposed by Tarvainen & Valpola [24] as it forms one of the basis of our approach. Their approach proposes two networks: a student network and a teacher network, where the weights of the teacher are the exponential moving average of those of the student. They enforce the student network to make consistent predictions with the teacher network. In our method, we use the teacher network in meta-test to train the student network such that it is more tolerant to label noise.

Iii Method

Fig. 1: Left: conventional gradient update with cross entropy loss may overfit to label noise. Right: a meta-learning update is performed beforehand using synthetic label noise, which encourages the network parameters to be noise-tolerant and reduces overfitting during the conventional update.
Fig. 2: Illustration of the proposed meta-learning based noise-tolerant (MLNT) training. For each mini-batch of training data, a meta loss is minimized before training on the conventional classification loss. We first generate multiple mini-batches of synthetic noisy labels with random neighbor label transfer (marked by orange arrow). The random neighbor label transfer can preserve the underlying noise transition (e.g. DEER HORSE, CAT DOG), therefore generating synthetic label noise in a similar distribution as the original data. For each synthetic mini-batch, we update the parameters with gradient descent, and enforce the updated model to give consistent predictions with a teacher model. The meta-objective is to minimize the consistency loss across all updated models w.r.t .

Iii-a Problem Statement

We consider a classification problem with a training set , where denotes the sample and

is a one-hot vector representing the corresponding noisy label over

classes. Let denotes the discriminative function of a neural network parameterized by , which maps an input to an output of the

-class softmax layer. The conventional objective for supervised classification is to minimize an empirical risk, such as the cross entropy loss:

(1)

where denotes dot product.

However, since contains noise, the neural network can overfit and perform poorly on the test set. We propose a meta-objective that encourages the network to learn noise-tolerant parameters. The details are delineated next.

Iii-B Meta-Learning based Noise-Tolerant Training

Our method can learn the parameters of a DNN model in such a way as to “prepare” the model for label noise. The intuition behind our method is that when training with a gradient-based rule, some network parameters are more tolerant to label noise than others. How can we encourage the emergence of such noise-tolerant parameters? We achieve this by introducing a meta-learning update before the conventional update for each mini-batch. The meta-learning update simulates the process of training with label noise and makes the network less prone to over-fitting. Specifically, for each mini-batch of training data, we generate a variety of synthetic noisy labels on the same images. With each set of synthetic noisy labels, we update the network parameters using one gradient update, and enforce the updated network to give consistent predictions with a teacher model unaffected by the synthetic noise. As shown in Figure 1, the meta-learning update optimizes the model so that it can learn better with conventional gradient update on the original mini-batch. In effect, we aim to find model parameters that are less sensitive to label noise and can consistently learn the underlying knowledge from data despite label noise. The proposed meta-learning update consists of two procedures: meta-train and meta-test.

Meta-Train. Formally, at each training step, we consider a mini-batch of data sampled from the training set, where are samples, and are the corresponding noisy labels. We want to generate multiple mini-batches of noisy labels with similar label noise distribution as . We will describe the procedure for generating one set of noisy labels . First, we randomly select samples out of the mini-batch of samples. For each selected sample , we rank its neighbors within the mini-batch. Then we randomly select a neighbor from its top 10 nearest neighbors (10 is experimentally determined), and use the neighbor’s label to replace the label for , . Because we transfer labels among neighbors, the synthetic noisy labels are from a similar distribution as the original noisy labels. We repeat the above procedure times to generate mini-batches of synthetic noisy labels. Note that we compute nearest neighbors based on the Euclidean distance between feature representations (pre-softmax layer activations) generated by a DNN pre-trained on the entire noisy training set .

Let denote the current model’s parameters, for each synthetic mini-batch , we update to using one gradient descent step on the mini-batch.

(2)

where is the cross entropy loss described in equation 1, and is the step size.

Meta-Test. Our meta-objective is to train such that each updated parameters do not overfit to the specific noisy labels . We achieve this by enforcing each updated model to give consistent predictions with a teacher model. We consider the model parameterized by as the student model, and construct the teacher model parameterized by following the self-ensembling [24] method. The teacher model usually gives better predictions than the student model. Its parameters are computed as the exponential moving average (EMA) of the student’s parameters. Specifically, at each training step, we update with:

(3)

where is a smoothing coefficient hyper-parameter.

Since the teacher model is unaffected by the synthetic label noise, we enforce a consistency loss that encourages each updated model (with parameters ) to give consistent predictions with the teacher model on the same input . We define as the Kullback-Leibler (KL)-divergence between the softmax predictions from the updated model and the softmax predictions from the teacher model . We find that KL-divergence produces better results compared to the mean squared error used in [24].

(4)
(5)

We want to minimize the consistency loss for all of the updated models with parameters . Therefore, the meta loss is defined as the average of all consistency losses:

(6)
(7)
1:Randomly initialize
2:Initialize teacher model
3:while not done do
4:     Sample a mini-batch of size from .
5:     for  do
6:          Generate synthetic noisy labels by random neighbor label transfer
7:          Compute updated parameters with gradient descent:
8:          Evaluate consistency loss with teacher:
9:     end for
10:     Evaluate
11:     Meta-learning update
12:     Evaluate classification loss
13:     Update
14:     Update teacher model:
15:end while
Algorithm 1 Meta-Learning based Noise-Tolerant Training

Although is computed using the updated model parameters , the optimization is performed over the student model parameters

. We perform stochastic gradient descent (SGD) to minimize the meta loss. The model parameters

are updated as follows:

(8)

where is the meta-learning rate.

After the meta-learning update, we perform SGD to optimize the classification loss on the original mini-batch .

(9)

where is the learning rate. The full algorithm is outlined in Algorithm 1.

Note that the meta-gradient involves a gradient through a gradient, which requires calculating the second-order derivatives with respect to . In our experiments we use a first-order approximation by omitting the second-order derivatives, which can significantly increase the computation speed. The comparison in Section IV-D shows that this approximation performs almost as well as using second-order derivatives. This provides another intuition to explain our method: The first-order approximation considers the term in equation 7 as a constant. Therefore, we can consider the update as injecting data-dependent noise to the parameters, and adding noise to the network during training has been shown by many studies to have a regularization effect [21, 10].

Iii-C Iterative Training

We propose an iterative training scheme for two purposes: (1) Remove samples with potentially wrong class labels from the classification loss . (2) Improve the predictions from the teacher model so that the consistency loss is more effective.

First, we perform an initial training iteration following the method described in Algorithm 1, and acquire a model with the best validation accuracy (usually the teacher). We name that model as mentor and use to denote its parameters. In the second training iteration, we repeat the steps in Algorithm 1 with two changes described as follows.

First, if the classification loss is applied to the entire training set

, samples with wrong ground-truth labels can corrupt training. Therefore, we remove a sample from the classification loss if the mentor model assigns a low probability to the ground-truth class. In effect, the classification loss would now sample batches from a filtered training set

which contains fewer corrupted samples.

(10)

where is the softmax prediction of the mentor model, and is a threshold to control the balance between the quality and quantity of .

Second, we improve the effectiveness of the consistency loss by merging the predictions from the mentor model and the teacher model to produce more reliable predictions. The new consistency loss is:

(11)
(12)

where is a weight to control the importance of the teacher model and the mentor model. It is ramped up from 0 to 0.5 as training proceeds.

We train for three iterations in our experiments. The mentor model is the best model from the previous iteration. We observe that further training iterations beyond three do not give noticeable performance improvement.

Iv Experiments

Method
Cross Entropy [23] 93.5 91.0 88.4 85.0 78.4 41.1
Cross Entropy (reproduced) 91.840.05 90.330.06 87.850.08 84.620.08 78.060.16 45.850.91
Joint Optimization [23] 93.4 92.7 91.4 89.6 85.9 58.0
MLNT-student (1st iter.) 93.180.07 92.160.05 90.570.08 87.680.06 81.960.19 55.451.11
MLNT-teacher (1st iter.) 93.210.07 92.430.05 91.060.07 88.430.05 83.270.22 57.391.13
MLNT-student (2nd iter.) 93.240.09 92.630.07 91.990.13 89.710.07 86.280.19 58.211.09
MLNT-teacher (2nd iter.) 93.350.07 92.910.09 91.890.06 90.030.08 86.240.18 58.331.10
MLNT-student (3nd iter.) 93.290.08 92.910.10 92.020.09 90.270.10 86.950.17 58.571.12
MLNT-teacher (3nd iter.) 93.520.08 93.240.08 92.500.07 90.650.09 87.110.19 59.091.12
TABLE I: Classification accuracy (%) on CIFAR-10 test set for different methods trained with symmetric

label noise. We report the mean and standard error across 5 runs.

Iv-a Datasets

We conduct experiments on two datasets, namely CIFAR-10 [9] and Clothing1M [30]. We follow the same experimental setting as previous studies [16, 23, 26] for fair comparison.

For CIFAR-10, we split 10% of the training data for validation, and artificially corrupt the rest of the training data with two types of label noise: symmetric and asymmetric. The symmetric label noise is injected by using a random one-hot vector to replace the ground-truth label of a sample with a probability of . The asymmetric label noise is designed to mimic some of the structure of real mistakes for similar classes [16]: TRUCK AUTOMOBILE, BIRD AIRPLANE, DEER HORSE, CAT DOG. Label transitions are parameterized by such that true class and wrong class have probability of and , respectively.

Clothing1M [30] consists of 1M images collected from online shopping websites, which are classified into 14 classes, e.g. t-shirt, sweater, jacket. The labels are generated using surrounding texts provided by sellers, which contain real-world errors. We use the and clean data for validation and test, respectively, but we do not use the clean training data.

Method
Cross Entropy [23] 91.8 90.8 90.0 87.1 77.3
Cross Entropy (reproduced) 91.040.07 90.190.09 88.880.06 86.340.22 77.480.79
Forward [16] 92.4 91.4 91.0 90.3 83.8
CNN-CRF [26] 92.0 91.5 90.7 89.5 84.0
Joint Optimization [23] 93.2 92.7 92.4 91.5 84.6
MLNT-student (1st iter.) 92.89 0.11 91.840.10 90.550.09 88.700.13 79.950.71
MLNT-teacher (1st iter.) 93.05 0.10 92.190.09 91.470.04 88.690.08 78.440.45
MLNT-student (2nd iter.) 93.010.12 92.650.11 91.870.12 90.600.12 81.530.66
MLNT-teacher (2nd iter.) 93.330.13 92.970.11 92.430.19 90.930.15 81.470.54
MLNT-student (3nd iter.) 93.360.14 92.980.13 92.590.10 91.870.12 82.250.68
MLNT-teacher (3nd iter.) 93.610.10 93.250.12 92.820.18 92.300.10 82.090.47
TABLE II: Classification accuracy (%) on CIFAR-10 test set for different methods trained with asymmetric label noise. We report the mean and standard error across 5 runs.

Iv-B Implementation

For experiments on CIFAR-10, we follow the same experimental setting as [23] and use the network based on PreAct ResNet-32 [7]. By common practice [23], we normalize the images, and perform data augmentation by random horizontal flip and

random cropping after padding 4 pixels per side. We use a batch size

, a step size , a learning rate , and update using SGD with a momentum of 0.9 and a weight decay of

. For each training iteration, we divide the learning rate by 10 after 80 epochs, and train until 120 epochs. For the initial iteration, we ramp up

(meta-learning rate) from 0 to 0.4 during the first 20 epochs, and keep for the rest of the training. In terms of the EMA decay , we use for the first 20 epochs and later on, because the student improves quickly early in the training, and thus the teacher should have a shorter memory [24]. In the ablation study (Section IV-D), we will show the effect of the three important hyper-parameters, namely , the number of synthetic mini-batches, , the number of samples with label replacement, and the threshold for data filtering. The value for all hyper-parameters are determined via validation.

For experiments on Clothing1M, we follow previous works [23, 16] and use the ResNet-50 [6] pre-trained on ImageNet. For preprocessing, we resize the image to , crop the middle as input, and perform normalization. We use a batch size , a step size , a learning rate , and update using SGD with a momentum of 0.9 and a weight decay of . We train for 3 epochs for each iteration. During the first 2000 mini-batches in the initial iteration, we ramp up from 0 to , and set . For the rest of the training, we use and . Other hyper-parameters are set as , , and .

Iv-C Experiments on CIFAR-10

We compare the proposed MLNT with multiple baseline methods using CIFAR-10 dataset with symmetric label noise (noise ratio ) and asymmetric label noise (noise ratio ). The baselines include:

(a) Training accuracy on noisy training data

(b) Test accuracy on clean test data
Fig. 3: Progressive performance comparison of the proposed MLNT and Cross Entropy as training proceeds.
Fig. 4: Performance of MLNT-student (1st iter.) on CIFAR-10 trained with different number of synthetic mini-batches .

(1) Cross Entropy: conventional training without the meta-learning update. We report both the results from [23] and from our own implementation.

(2) Forward [16]: forward loss correction based on the noise transition matrix.

(3) CNN-CRF [26]: a CRF model is proposed to represent the relationship between noisy and clean labels. It requires a small set of clean labels during training.

(4) Joint Optimization [23]: alternatively updates network parameters and corrects labels during training.

Both Forward [16] and CNN-CRF [26] require the ground-truth noise transition matrix, and Joint Optimization [23] requires the ground-truth class distribution among training data. Our method does not require any prior knowledge on the data, thus is more general. Note that all baselines use the same network architecture as our method. We report the numbers published in [23].

Table I and Table II show the results for symmetric and asymmetric label noise, respectively. Our implementation of Cross Entropy has lower overall accuracy compared to [23]

. The reason could be the different programming frameworks used (we use PyTorch 

[15], whereas [23] used Chainer [25]). For both types of noise, the proposed MLNT method with one training iteration significantly improves accuracy compared to Cross Entropy (reproduced), and achieves comparable performance to state-of-the-art methods. Iterative training further improves the performance. MLNT-teacher after three iterations significantly outperforms previous methods. An exception where MLNT does not outperform baselines is with 50% asymmetric label noise. This is because that asymmetric label noise is generated by exchanging CAT and DOG classes, and it is theoretically impossible to distinguish them without prior knowledge when the noise ratio is 50%.

In Table I, we also show the results with clean training data (). The proposed MLNT can achieve an improvement of in accuracy compared to Cross Entropy, which shows the regularization effect of the proposed meta-objective.

Iv-D Ablation Study

Progressive Comparison. Figure 3 plots the model’s accuracy on noisy training data and its test accuracy on clean test data as training proceeds. We show a representative training process using asymmetric label noise with . Accuracy is calculated every epoch, and training accuracy is computed across all mini-batches within the epoch. Comparing the proposed MLNT methods (1st iter.) with Cross Entropy, MLNT learns more quickly during the beginning of training, as shown by the higher test accuracy. Cross Entropy has the most unstable training process, as shown by its fluctuating test accuracy curve. MLNT-student is more stable because of the regularization from the meta-objective, whereas MLNT-teacher is extremely stable because its parameters change smoothly during training. At the 80th epoch, the learning rate is divided by 10, which causes a drastic increase in both training and test accuracy for MLNT-student and Cross Entropy. After the 80th epoch, the model begins to overfit because of the small learning rate. However, the proposed MLNT-student suffers less overfitting compared to Cross Entropy, as shown by its lower training accuracy and higher test accuracy.

Hyper-parameters. We conduct ablation study to examine the effect of three hyper-parameters: . is the number of mini-batches with synthetic noisy labels that we generate for each mini-batch from the original training data. Intuitively, with larger , the model is exposed to a wider variety of label noise, and thus can learn to be more noise-tolerant. In Figure 4 we show the test accuracy on CIFAR-10 for MLNT-student (1st iter.) with ( is the same as Cross Entropy) trained using labels with symmetric noise (SN) and asymmetric noise (AN) of different ratio. The result shows that the accuracy indeed increases as increases. The increase is most significant when changes from 0 to 5, and is marginal when changes from 10 to 15. Therefore, the experiments in this paper are conducted using , as a trade-off between the training speed and the model’s performance.

Fig. 5: Performance of MLNT-student (1st iter.) on CIFAR-10 trained with asymmetric label noise using different .

is the number of samples whose labels are changed in each synthetic mini-batch of size . We experiment with , which correspond to samples with a batch size of 128. Figure 5 shows the performance of MLNT-student (1st iter.) using different trained on CIFAR-10 with different ratio of asymmetric label noise. The performance is insensitive to the value of . For different noise ratio, the optimal generally falls into the range of .

Fig. 6: Performance of MLNT-student (2nd iter.) on CIFAR-10 trained with asymmetric label noise using different .
Fig. 7: Example images from Clothing1M that are filtered out by the mentor model. We show the ground-truth label (red) and the label predicted by the mentor (blue) with their corresponding probability scores.

is the threshold to determine which samples are filtered out by the mentor model during the 2nd and 3nd training iteration. It controls the balance between the quality and quantity of the data that is used by the classification loss. In Figure 6 we show the performance of MLNT-student (2nd iter.) trained using different value of . As the noise ratio increases, the optimal value of also increases to filter out more samples. Figure 7 shows some example images from Clothing1M dataset that are filtered out and their corresponding probability scores given by the mentor model.

Full optimization. We have been using a first-order approximation to optimize for faster computation speed. Here we conduct experiments using full optimization by including the second-order derivative with respect to . Table III shows the comparison on four representative sets of experiments with different label noise. We show the test accuracy (averaged across 5 runs) of MLNT-student (1st iter.) trained with full optimization and first-order approximation. The result shows that the performance from first-order approximation is nearly the same as that obtained with full second derivatives. This suggests that the improvement of MLNT mostly comes from the gradients of the meta loss at the updated parameter values, rather than the second-order gradients.

Optimization SN AN
First-order approx. 90.57 81.96 91.84 88.70
Full 90.74 82.05 91.89 88.91
TABLE III: Test accuracy (%) on CIFAR-10 for MNLT-student (1st iter.) with full optimization of the meta-loss and its first-order approximation.
Method Accuracy
#1 Cross Entropy [23] 69.15
Cross Entropy (reproduced) 69.28
#2 Forward [16] 69.84
#3 Joint Optimization [23] 72.16
MLNT-student (1st iter.) 72.34
MLNT-teacher (1st iter.) 72.08
MLNT-student (2nd iter.) 73.13
MLNT-teacher (2nd iter.) 73.10
MLNT-student (3nd iter.) 73.44
MLNT-teacher (3nd iter.) 73.47
TABLE IV: Classification accuracy (%) of different methods on the Clothing1M test set.

Iv-E Experiments on Clothing1M

We demonstrate the efficacy of the proposed method on real-world noisy labels using the Clothing1M dataset. The results are shown in Table IV. We show the accuracy for baselines #1 and #3 reported in [23], and the accuracy for #2 reported in [16]. The proposed MLNT method with one training iteration achieves better performance compared to state-of-the-art methods. After three training iterations, MLNT achieves a significant improvement in accuracy of over Cross Entropy, and an improvement of over the best baseline method #3.

V Conclusion

In this paper, we propose a meta-learning method to learn from noisy labeled data, where a meta-learning update is performed prior to conventional gradient update. The proposed meta-objective aims to find noise-tolerant model parameters that are less prone to overfitting. In the meta-train step, we generate multiple mini-batches with synthetic noisy labels, and use them to update the parameters. In the meta-test step, we apply a consistency loss between each updated model and a teacher model, and train the original parameters to minimize the total consistency loss. In addition, we propose an iterative training scheme, where the model from previous iteration is used to clean data and refine predictions. We evaluate the proposed method on two datasets. The results validate the advantageous performance of our method compared to state-of-the-art methods. Our code will be available at www.url.com

. For future work, we plan to explore the use of the proposed model-agnostic training method to other domains with different model architectures, such as learning Recurrent Neural Networks for machine translation with corrupted ground-truth sentences.

References

  • [1] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126–1135, 2017.
  • [2] B. Frénay and M. Verleysen. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5):845–869, 2014.
  • [3] A. Ghosh, H. Kumar, and P. S. Sastry. Robust loss functions under label noise for deep neural networks. In AAAI, pages 1919–1925, 2017.
  • [4] A. Ghosh, N. Manwani, and P. S. Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 160:93–107, 2015.
  • [5] J. Goldberger and E. Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, pages 630–645, 2016.
  • [8] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, pages 2309–2318, 2018.
  • [9] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Mater’s thesis, University of Toronto, 2009.
  • [10] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In ICLR, 2017.
  • [11] K. Lee, X. He, L. Zhang, and L. Yang.

    Cleannet: Transfer learning for scalable image classifier training with label noise.

    In CVPR, pages 5447–5456, 2018.
  • [12] D. Li, Y. Yang, Y. Song, and T. M. Hospedales. Learning to generalize: Meta-learning for domain generalization. In AAAI, 2018.
  • [13] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L. Li. Learning from noisy labels with distillation. In ICCV, pages 1928–1936, 2017.
  • [14] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014.
  • [15] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017.
  • [16] G. Patrini, A. Rozza, A. K. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, pages 2233–2241, 2017.
  • [17] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017.
  • [18] S. E. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
  • [19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  • [20] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NIPS, pages 1163–1171, 2016.
  • [21] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    , 15(1):1929–1958, 2014.
  • [22] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus. Training convolutional networks with noisy labels. In ICLR Workshop, 2015.
  • [23] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa. Joint optimization framework for learning with noisy labels. In CVPR, pages 5552–5560, 2018.
  • [24] A. Tarvainen and H. Valpola.

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.

    In NIPS, pages 1195–1204, 2017.
  • [25] S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next-generation open source framework for deep learning. In NIPS Workshop, 2015.
  • [26] A. Vahdat. Toward robustness against label noise in training deep discriminative neural networks. In NIPS, pages 5601–5610, 2017.
  • [27] B. van Rooyen, A. K. Menon, and R. C. Williamson. Learning with symmetric label noise: The importance of being unhinged. In NIPS, pages 10–18, 2015.
  • [28] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. J. Belongie. Learning from noisy large-scale datasets with minimal supervision. In CVPR, pages 6575–6583, 2017.
  • [29] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S. Xia. Iterative learning with open-set noisy labels. In CVPR, pages 8688–8696, 2018.
  • [30] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015.
  • [31] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.