1 Introduction
Deep neural networks (DNNs) have recently obtained impressive good performance on various applications due to their powerful capacity for modeling complex input patterns. Despite their success, deep networks can easily overfit to biased training data containing corrupted labels
(Zhang et al., 2017), leading to their poor performance in generalization in such cases. This issue has been theoretically illustrated in multiple current literatures (Neyshabur et al., 2017; Arpit et al., 2017; Kawaguchi et al., 2017; Novak et al., 2018).In practice, however, such corrupted label issue is always encountered due to the high cost of labor and time to obtain high quality of data annotations (Sukhbaatar & Fergus, 2015; Azadi et al., 2016; Goldberger & BenReuven, 2017; Li et al., 2017; Vahdat, 2017; Hendrycks et al., 2018; Han et al., 2018; Zhang & Sabuncu, 2018). A typical example is a dataset collected from a crowdsourcing system (Bi et al., 2014) or search engines (Liang et al., 2016; Zhuang et al., 2017), which would possibly yield a large amount of noisy labels. Effective learning with such corrupted labels, which can be regarded to be biased from groundtruth ones, is thus an important while challenging issue in machine learning (Jiang et al., 2018; Ren et al., 2018).
The sample weighting approach is commonly utilized against this issue. The main methodology is to design individual weighting schemes on samples based on specific tasks and models. The early attempts include the dataset resampling (Chawla et al., 2002) and instance reweight (Zadrozny, 2004) methods, imposing sample weights by making use of proper prior knowledge and minimizing the weighted loss on training samples. After that, multiple reweighting methods have been presented via dynamically updating sample weight through the learning process to ameliorate such preset weight manner. The clue is mainly achieved from the loss values of samples in training. The approach can be divided into two categories. One more emphasizes the samples with larger loss values since they are more like to be uncertain hard samples located on the classification boundary containing more information for distinguishing classes. Typical methods include AdaBoost (Freund & Schapire, 1997; Sun et al., 2007), hard negative mining (Malisiewicz et al., 2011) and focal loss (Lin et al., 2018). The other takes the samples with smaller loss values as more important since they intend to be highconfident with clean labels, while suppress the effects of those with extremely large loss values. Typical methods include selfpaced learning (Kumar et al., 2010), iterative reweighting (Fernando & Mkchael, 2003; Zhang & Sabuncu, 2018) and multiple variants (Jiang et al., 2014a, b; Wang et al., 2017). Such reweighting strategies enhance the flexibility of sample weight evaluation and extend the feasible domains beyond the initial sample weight schemes. These method, however, still need certain assumptions for constructing models, inevitably involving hyperparameters required to be manually preset or tuned by crossvalidation. This tends to raise their application difficulty and reduce their performance stability in handling real problems.
Very recently, the metalearning regime becomes a new trend for this issue. The basic idea is first to precollect a small amount of unbiased metadata with clean labels to simulate the underlying metaknowledge of the true samplelabel distribution, and then design a metalearner on the basis of such metadata beyond the learner constructed on the original much larger biased training dataset with noisy labels. The sample weights can then be determined by metalearner when inputting sample loss on learner. In this metalearning paradigm, the tuning of hyperparameters becomes automatic and is embedded into the learning process. However, to guarantee a strong capability of hyperparameter learning, these methods generally need to construct a complex structure of metalearner. The representative examples include the FWL (Dehghani et al., 2018), learning to teach (Fan et al., 2018; Wu et al., 2018) and MentorNet (Jiang et al., 2018) methods, whose metalearners are designed as a Bayesian function approximator, a deep neural network with attention mechanism, and a bidirectional LSTM network, respectively. This makes the algorithms of these metalearning methods hard to be fully reproduced and understood by general users.
To alleviate this issue, this paper proposes a new metalearning method. The main idea is to progressively rectify the sample weights by the metalearner so as to gradually correct the gradient of the weighted loss computed on the biased training set approaching to the descent direction of the loss calculated on the unbiased metadata. In summary, the method has the following threefold specific characteristics.
Firstly, the metalearner used in our method, called VNet, is a MLP network with only one hidden layer, whose form is extremely simpler as compared with those used in other metalearning methods. All updating equations for the parameters of the metalearner and the classifier are in closed form and explicitly expressed (as illustrated in Fig. 1). The code of the algorithm is thus easy to be reproduced.
Secondly, the working mechanism of the proposed metalearning method finely complies with the education process of a student by a good teacher, which can be properly described as the quote cited in the beginning of the paper, said by the famous journalist and anchor, Dan Rather. Specifically, the teacher helps progressively correct the learning manner of the student (i.e., the gradient of the biased training loss), under the guidance of metaknowledge (i.e., unbiased metadata) only possessed by teachers, to the right direction oriented to the truth (i.e., the descent direction of the metaloss). In this progressive learning process, the teacher and the student are collaborated to be ameliorated from each other (i.e., both parameters of the metalearner and the classifier are gradually optimized in the learning process). We expect such intuitive explanations could make the idea of our method easily understood by common readers.
Thirdly, the insights of why the proposed algorithm can work can be well interpreted. Particularly, the closedform updating equation for the parameter of the metalearner can be properly explained to improve the sample weights of those samples better complying with the metadata knowledge, while suppress weights of those violating such metaknowledge. This tallies with our common sense on the problem: we should emphasize samples unbiased to the underlying groundtruth label distribution, while reduce the influence of those highly biased ones.
Experimental results substantiate the robustness of the proposed method on corrupted labels. Especially, we depict that the sample weights can be ameliorated in a stable manner, which makes our algorithm better memory historical learning knowledge underlying both the sample weights as well as the parameters of the metalearner and the classifier.
2 Related Work
Sample Weighting Methods. The idea of weighting training samples can be dated back to dataset resampling (Chawla et al., 2002) or instance reweight (Zadrozny, 2004), which preevaluate the sample weights as a preprocessing step by using certain prior knowledge on the task or data. To make the method more flexible and the sample weights better fit data, more recent researches focused on the dynamic sample reweighting regimes to ameliorate weights during training process. Typical reweighting methods include the boosting algorithm and multiple of its variations, like the known AdaBoost method (Freund & Schapire, 1997), which dynamically emphasizes relatively harder examples and imposes larger weights to ones with larger loss values. Hard example mining (Malisiewicz et al., 2011) and focal loss (Lin et al., 2018) also enhance the functions of harder examples and put larger weights on them. On the contrary, another series of methods, like selfpaced learning (Kumar et al., 2010), put more focus on easy samples with smaller losses. Afterwards, multiple extensions on selfpaced learning are presented by incorporating with more knowledge of practical tasks (Jiang et al., 2014a, b). Iterative reweighting strategy (Fernando & Mkchael, 2003; Zhang & Sabuncu, 2018)
also inclines to emphasize the functions of easy samples while suppress those of hard ones, possibly with corrupted label noises. Some other methods along this line include the prediction variance method
(Chang et al., 2017), emphasizing more on uncertain samples to improve minibatch SGD for classification, and the method given in (Wang et al., 2017), inferring the example weights as latent variables under an elaborately designed Bayesian framework. Although these methods more or less alleviate the corrupted label noise issue, they still need certain assumptions for designing their methods, naturally involving hyperparameters required to be manually preset or tuned by the crossvalidation. This, however, raises their difficulty to be readily used in real applications.Meta Learning Methods. Inspired by recent metalearning developments for fewshot learning (Lake et al., 2015; Shu et al., 2018; Ravi & Larochelle, 2017; Finn et al., 2017; Snell et al., 2017), where only a handful of examples are available for predicting classes, recently some metalearning regimes are also proposed for the issue to make the hyperparameter learning more automatic and reliable. Typical methods along this line include the FWL (Dehghani et al., 2018), learning to teach (Fan et al., 2018; Wu et al., 2018) and MentorNet (Jiang et al., 2018) methods, whose metalearners are designed as a Bayesian function approximator, a deep neural network with attention mechanism, a bidirectional LSTM network, respectively. Another method called L2RW (learning to reweight) (Ren et al., 2018), does not explicitly assume metalearner while directly updates weights from zeros in each metalearning iteration. For the former three methods, the metalearners are with relatively more complex structures and their insights are hard to be fully interpreted. Comparatively, our method is with a much simplest form of metalearner compared with them, facilitating its easier code reproduction and understanding by users. The latter L2RW method, however, does not set metalearner to guide the weight learning, which tends to make it possess the instable learned weights across the entire learning process and slower convergence speed. Comparatively, our method can learn weights in a more constant manner, which makes the method convergent in less iterations, as clearly shown in Fig. 3 and 4.
Other Methods for Corrupted Labels. Learning with corrupted labels has attracted increasing attention recently both on theory and practice (Natarajan et al., 2013; Van Rooyen & Williamson, 2018). For literature comprehensiveness, we also list some other methods proposed for this task. Multiple methods are designed by correcting noisy labels to their true labels via a supplementary clean label inference step (Azadi et al., 2016; Vahdat, 2017; Veit et al., 2017; Li et al., 2017; Jiang et al., 2018; Ren et al., 2018; Hendrycks et al., 2018). A typical method is Li et al. (2017)
, to distill the knowledge from clean labels and knowledge graph, and can be exploited to learn a better model from noisy labels. More recently, GLC
(Hendrycks et al., 2018) proposed a loss correction approach to mitigate the effects of label noise on deep neural network classifiers. Some other methods are also constructed by using proper methods to directly learn from corrupt labels. Typical ones include the Reed (Reed et al., 2015), Cotraining (Han et al., 2018), D2L (Ma et al., 2018), SModel (Goldberger & BenReuven, 2017) methods. We will compare these methods in our experiments to make the superiority of the proposed method convincible.3 The Proposed Metalearning Method
In this section, we introduce the methodology of the proposed progressive gradient correcting method by metalearner (MetaPGC in brief), and its algorithm as well as its convergence analysis. To make all steps of the algorithm easily understandable, we introduce its implementations along with their corresponding interpretations from the perspective of the education process of a student by a teacher.
3.1 The MetaPGC Method
Consider a classification problem with the training set , where denotes the th sample,
is the noisy label vector over
classes, and is the number of the entire training data. denotes the classifier required to be achieved, where denotes the model parameters. In current applications, is always set as a network. We thus also adopt the network structure for the classifier, and call it the student network for convenience in the following.Generally, the optimal classifier parameter can be extracted by minimizing the loss calculated on the training set. For notation convenience, we denote that .
In the presence of corrupted labels, sample reweighting methods enhance the robustness of training by imposing weight on the th sample loss, where represents the hyperparameters contained in the weighting function. The optimal parameter of the student network is calculated by the following weighted loss minimization:
(1) 
VNet: Instead of manual presetting, our method aims to automatically learn the hyperparameter in a metalearning manner. To this aim, we formulate as a neural network, called VNet, as a MLP network with only one hidden layer containing 100 nodes (i.e., the network is with a 11001 structure), as shown in Fig. 1
. Each hidden node is with ReLU activation function, and the output is with the Sigmoid activation function, to guarantee the output located in the interval of
. As compared with metalearners designed in current metalearning methods (Dehghani et al., 2018; Fan et al., 2018; Wu et al., 2018; Jiang et al., 2018), such VNet is with simpler form and less parameters.Meta learning. The hyperparameters contained in VNet can be optimized by using the meta learning idea (Wu et al., 2018; Andrychowicz et al., 2016; Dehghani et al., 2017; Franceschi et al., 2018). Specifically, assume that we have a small amount unbiased metadata set (i.e., with clean labels) , representing the metaknowledge of groundtruth samplelabel distribution, where is the number of metasamples and . The optimal hyperparameter can then be obtained by minimizing the metaloss calculated on metadata as follows:
(2) 
where .
Analogy with the education process, we can understand Eq. (1) as the learning manner of a student, the weight function (i.e., VNet) as the the correction way for such learning manner imposed by a teacher (emphasizing more on valuable samples), Eq. (2) as the selfrectifying process of the teacher (i.e., the metalearner) based on the teaching effect feeded back from the student.
Formulating learning manner of student network: To get the updating equation for the parameter of the student network, we need to calculate the gradient of the training loss objective in Eq. (1) so as to ameliorate the student network by gradient decent. As general network training tricks, we also employ SGD for this training task. Specifically, in each iteration of training, a minibatch of training samples is sampled, where is the minibatch size, . Then the updating equation of the student network parameter can be formulated by moving the current along the descent direction of the objective loss in Eq. (1) on a minibatch training data. i.e.,
(3) 
where is the step size.
In perspective of education, this step can be explained as that the student attempts to explore the optimal learning manner for achieving expected knowledge purely on training data based on his/her previous learning experience . Note that the learning manner is parameterized by the hyperparameter of the teacher, which will be feeded back to teacher and further ameliorated by the teacher.
Updating parameters of VNet: After receiving the feedback of the model parameter updating formulation from the last step, the parameter of the VNet can then be readily updated guided by Eq. (2), i.e., moving the current parameter along the objective gradient of Eq. (2) calculated on the metadata. Similar to the last step, the SGD scheme is adopted. The updating equation for is then:
(4) 
where denote a minibatch metadata set calculated in the current step.
This step can be explained as that the teacher learns the right way for rectifying the learning manner of the student based on the metadata knowledge. Just as Dan said: the teacher tugs/pushes/leads the student to the next plateau, poking the student with a stick of truth.
Updating parameters of student network: Then, the weights can be readily passed to the updating equation (3) to correct the gradient for ameliorating the parameter of the student network, i.e.,
(5) 
This can be easily explained as that the teacher helps to tug the student’s learning to the righter direction and make him/her achieve better learning effect in the process.
3.2 Analysis on the Weighting Scheme of VNet
The computation of Eq. (3) and Eq. (5) can be easily implemented by automatic differentiation, and the computation of Eq. (4) can be tackled by following derivation:
(6) 
where
(7) 
More details on calculating Eq. (6) have been presented in supplementary material.
Some interesting observations can be attained from Eq. (4), which facilitates a better understanding to the insight why our weighting scheme can work on biased labeling samples. By substituting Eq. (6) and (7) into Eq. (4), we can get:
(8) 
Neglecting the coefficient , it is easy to see that each term in the sum orients to the ascend gradient of the weight function . , the coefficient imposed on the th gradient term, represents the similarity between the gradient of the th training sample computed on training loss and the average gradient of the minibatch meta data calculated on meta loss. That is to say, if the learning gradient of a training sample is similar to that of the meta samples, then the sample will be considered as beneficial for getting right results and its weight tends to be more possibly increased. And conversely, the weight of the sample inclines to be suppressed.
Datasets / Noise Rate  BaseModel  ReedHard  SModel  Selfpaced  Focal Loss  MentorNet  Coteaching  D2L  Finetining  L2RW  GLC  Ours  

CIFAR10  0%  95.600.22  94.380.14  83.790.11  90.810.34  95.700.15  94.350.42  88.670.25  94.640.33  95.650.15  92.380.10  94.300.19  94.520.25 
40%  68.071.23  81.260.51  79.580.33  86.410.29  75.961.31  87.330.22  74.810.34  85.600.13  80.470.25  86.920.19  88.280.03  88.790.28  
60%  53.123.03  73.531.54  —  53.101.78  51.871.19  82.801.35  68.020.41  78.752.40  73.060.25  82.240.36  83.490.24  84.070.33  
CIFAR100  0%  79.951.26  64.451.02  52.860.99  59.790.46  81.040.24  73.261.23  61.800.25  66.171.42  80.880.21  72.990.58  73.750.51  77.790.24 
40%  51.110.42  51.271.18  42.120.99  46.312.45  51.190.46  61.393.99  46.200.15  52.10 0.97  52.490.74  60.790.91  61.310.22  66.520.26  
60%  30.920.33  26.950.98  —  19.080.57  27.703.77  36.871.47  36.871.47  41.110.30  38.160.38  48.150.34  50.811.00  58.350.11 
Datasets / Noise Rate  BaseModel  ReedHard  SModel  Selfpaced  Focal Loss  MentorNet  Coteaching  D2L  Finetining  L2RW  GLC  Ours  

CIFAR10  0%  92.890.32  92.310.25  83.610.13  88.520.21  93.030.16  92.130.30  89.870.10  92.020.14  93.230.23  89.250.37  91.020.20  92.040.15 
20%  76.832.30  88.280.36  79.250.30  87.030.34  86.450.19  86.360.31  82.830.85  87.660.40  82.473.64  87.860.36  89.680.33  90.130.61  
40%  70.772.31  81.060.76  75.730.32  81.630.52  80.450.97  81.760.28  75.410.21  83.890.46  74.071.56  85.660.51  88.920.24  87.540.23  
CIFAR100  0%  70.500.12  69.020.32  51.460.20  67.550.27  70.020.53  70.240.21  63.310.05  68.110.26  70.720.22  61.841.09  65.420.23  69.130.33 
20%  50.860.27  60.270.76  45.450.25  63.630.30  61.870.30  61.970.47  54.130.55  63.480.53  56.980.50  57.471.16  63.070.53  63.840.28  
40%  43.011.16  50.401.01  43.810.15  53.510.53  54.130.40  52.660.56  44.850.81  51.830.33  46.370.25  50.981.55  62.220.62  58.640.47 
3.3 The MetaPGC Algorithm
The MetaPGC algorithm can then be summarized in Algorithm 1. Fig. 1
illustrates its main implementation process (steps 57) to help readers easily understand the flow of the algorithm. All steps of the algorithm are with closedform and explicit expressions. All computations of gradients are implemented by automatic differentiation techniques and can be generalized to any deep learning architectures for student network. The algorithm can be easily implemented using popular deep learning frameworks like PyTorch. Since both the student network and the VNet gradually ameliorate their parameters,
and , respectively, from their values calculated in the last step (as clearly shown in Fig. 1), the weights generally can be updated in a stable manner, as shown in Fig. 4.The convergence of training loss can be obtained for SGD based optimization methods, as known in (Reddi et al., 2016). Here, we further prove the convergence of our method on the perspective of monotonic decreasing of the meta loss objective. The proof is listed in the supplementary material.
Theorem 1.
Suppose the loss function
is Lipschitzsmooth in with constant , and is differentiable. And the loss function have bounded gradients with respect to training/meta data. Let the learning rate , where is the training batch size. Then the meta loss in Algorithm 1 always monotonically decreases for any sequence of training minibatches, namely,(9) 
4 Experimental Results
We evaluate the capability of our proposed algorithm and compare its performance with other stateoftheart methods in this section.
4.1 Robustness against Corrupted Labels
We first test the robustness of our algorithm on several benchmark datasets with corrupted labels with varying noise rates.
Datasets. Two benchmark datasets are employed: CIFAR10 and CIFAR100 (Krizhevsky, 2009). Both datasets are popularly used for evaluation of noisy labels in the current literatures (Ma et al., 2018; Han et al., 2018).
We study two settings of corrupted labels on training set:

Uniform noise.
The label of each image is independently changed to a random class with probability
following the same setting in (Zhang et al., 2017). 
Flip noise. The label of each image is independently flipped to similar classes with total probability . In our experiments, we randomly select two classes as similar classes with equal probability.
images with clean labels in validation set are randomly selected as the metadata set for all metalearning methods.
Structure of student network. We adopt a Wide ResNet2810 (WRN2810) (Zagoruyko & Komodakis, 2016) for uniform noise and ResNet32 (He et al., 2016) for flip noise as our student network models.
Comparison methods. We compare our algorithm with the following competing methods.

Reed (Reed et al., 2015) assumes a weighted combination of predicted and original labels as the correct labels, and then does back propagation.

SModel (Goldberger & BenReuven, 2017)
adds an additional softmax layer after the regular classification output layer to model the noise transition matrix.

MentorNet (Jiang et al., 2018) designs a meta learning model to train a teacher network with a sequence losses and example weights pairs.

Coteaching (Han et al., 2018) employs a ‘coteaching’ fashion to train two networks simultaneously to screen small loss data.

D2L (Ma et al., 2018) develops a dimensionalitydriven learning strategy, which monitors the dimensionality of subspaces during training and adapts the loss function accordingly.

L2RW (Ren et al., 2018) leverages an additional metadataset to adaptively assign weights to training examples in every iteration.

GLC (Hendrycks et al., 2018) uses trusted examples in a dataefficient manner to mitigate the effects of label noise on deep CNNs.
In addition, we compare two variations of our methods for more comprehensive comparison following the setting of (Ren et al., 2018). BaseModel refers to the similar student network utilized in our method, while directly trained on the given training data. For a fair comparison, we conduct an additional Finetuning method, by finetune the result of BaseModel on the metadata with clean labels to further enhance its performance. We also trained the baseline networks only on 1000 clean images, and the performance are evidently worse than proposed method due to the neglecting of the knowledge underlying large amount of training samples. We thus have not involved its results in comparison.
Parameter Settings. All the baseline networks were trained using SGD with momentum 0.9, weight decay and an initial learning rate of 0.1. The learning rate is divided by 10 after 18K steps and 19K steps (for a total of 20K steps) in uniform noise, and after 20K steps and 25K steps (for a total of 30K steps) in flip noise. We repeated the experiments 5 times for CIFAR10 and CIFAR100 with different random seeds for network initialization and label noise generation.
Results. We report accuracy over 5 repetitions for each series of experiments and each competing method in Tables 1 and 2. It can be observed that our method gets the best performance across almost all datasets and all noise rates, except the second for 40% Flip noise. At 0% noise cases, our method performs only slightly worse than the BaseModel, which is directly trained on clean datasets, and should be an upper limit of our method in such cases (without corrupted labels).
Besides, it can be seen that the performance gaps between ours and all other competing methods increase as the noise rate is increased from 40% to 60% under uniform noise. Even with 60% label noise, our method can still obtain a relatively high classification accuracy, and our method attains more than 15% accuracy gain compared with the second best result under 60% noise for CIFAR100 dataset, which indicates that our methods is potentially an effective strategy for semi/weaklysupervised learning, which will be investigated in our future research.
4.2 More Evaluations on Our Algorithm
We then depict more experimental results to evaluate more interesting capabilities of the proposed method. Since our method is inspired by L2RW (Ren et al., 2018), we compare this method, as well as the BaseModel, in this section for better illustration.
Robustness towards label noise overfitting issue. Fig. 2 plots the tendency curves of the minibatch training accuracy calculated on noisy training set in experiments, as well as those calculated simultaneously on clean test data during learning iterations. From the figure, we can easily find that the BaseModel can easily overfit to the noisy labels contained in the training set, whose test accuracy quickly degrades after the first learning rate decays. While our method and L2RW are less prone to such overfitting issue, they retain the similar test accuracy until termination. Especially, throughout all our experiments, we find that our method can converge significantly faster than the BaseModel and L2RW methods^{1}^{1}1The uniform noise experiment setting is the same as L2RW (Ren et al., 2018), and thus the iteration steps of BaseModel and L2RW follows the original setting., as clearly shown in Fig.2, and get the peak performance at around 20K steps, as compared with 60K iterations required for the other two methods, as shown in Fig. 2(a). We thus only report our results at 20K steps in the figure for other experiments.
Understanding the weighting mechanism. It is beneficial to understand our algorithm contributes to learning more robust models during training. Through inputting training samples to the outputted student network and VNet, we can obtain loss and corresponding weight values for all training samples. As shown in the upper right position of subplots in Fig. 3, the relation between loss and weight meets our expectation, i.e, VNet tends to impose smaller weights to larger loss sample, which are possibly be corrupted labels and should be suppressed. This can be rationally explained by the fact that we should more emphasize those highconfident samples unbiased to the underlying groundtruth label distribution, while reduce the influence of those highly biased ones. Furthermore, we plot the weight distribution of clean and noisy training samples in Fig. 3. It can be seen that almost all large weights belongs to clean samples, and the noisy samples’s weights all approach 0, which implies that the trained VNet can distinguish clean and noisy images.
Stability analysis of student network in iteration. The insight of our method is that we have one teacher who consistently supervises the student’s learning process and gradually rectifies his/her learning manner. This is expected to make the weight be updated in a stable manner. To verify this point, Fig. 4 plots the weight variation (the difference between adjacent epoches) along with training epoches under 60% noise on CIFAR10 dataset. It is seen that the weight in our method is continuously changed, gradually stable along iterations, and finally converges to a small weight close to 0. As a comparison, the weight learned during the learning process of L2RW fluctuates relatively more wildly, and seems not to be convergent. This could explain the consistently better performance of our method as compared with this competing method.
5 Conclusion
In this paper, we propose a novel metalearning method for training a classifier on corrupted labels. Compared with current metalearning methods, our method has few tuned hyperparameters and the metalearner’s structure is very simple. We propose an algorithm for jointly online optimizing student network and VNet, which finely complies with a real teaching progress to a student. The work principle of our algorithm can be well explained and the procedure of our method can be easily reproduced. Our empirical results substantiate the robustness of the new algorithm on corrupted training data, as well as its stable weights learning along iterations. This weight learning mechanism is hopeful to be extended to other weight setting problems in machine learning, like ensemble methods and multiview learning, which will be investigate in our future research.
References
 Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In NeurIPS, 2016.
 Arpit et al. (2017) Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., et al. A closer look at memorization in deep networks. In ICML, 2017.
 Azadi et al. (2016) Azadi, S., Feng, J., Jegelka, S., and Darrell, T. Auxiliary image regularization for deep cnns with noisy labels. In ICLR, 2016.
 Bi et al. (2014) Bi, W., Wang, L., Kwok, J. T., and Tu, Z. Learning to predict from crowdsourced data. In UAI, 2014.
 Chang et al. (2017) Chang, H.S., LearnedMiller, E., and McCallum, A. Active bias: Training more accurate neural networks by emphasizing high variance samples. In NeurIPS, 2017.

Chawla et al. (2002)
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P.
Smote: synthetic minority oversampling technique.
Journal of artificial intelligence research
, 16:321–357, 2002.  Dehghani et al. (2017) Dehghani, M., Severyn, A., Rothe, S., and Kamps, J. Learning to learn from weak supervision by full supervision. In NeurIPS Workshop, 2017.
 Dehghani et al. (2018) Dehghani, M., Mehrjou, A., Gouws, S., Kamps, J., and Schölkopf, B. Fidelityweighted learning. In ICLR, 2018.
 Fan et al. (2018) Fan, Y., Tian, F., Qin, T., Li, X.Y., and Liu, T.Y. Learning to teach. 2018.

Fernando & Mkchael (2003)
Fernando, D. l. T. and Mkchael, J. B.
A framework for robust subspace learning.
International Journal of Computer Vision
, 54(1):117–142, 2003.  Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.

Franceschi et al. (2018)
Franceschi, L., Frasconi, P., Salzo, S., and Pontil, M.
Bilevel programming for hyperparameter optimization and metalearning.
In ICML, 2018.  Freund & Schapire (1997) Freund, Y. and Schapire, R. E. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
 Goldberger & BenReuven (2017) Goldberger, J. and BenReuven, E. Training deep neuralnetworks using a noise adaptation layer. In ICLR, 2017.
 Han et al. (2018) Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Coteaching: robust training deep neural networks with extremely noisy labels. In NeurIPS, 2018.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.
 Hendrycks et al. (2018) Hendrycks, D., Mazeika, M., Wilson, D., and Gimpel, K. Using trusted data to train deep networks on labels corrupted by severe noise. In NeurIPS, 2018.
 Jiang et al. (2014a) Jiang, L., Meng, D., Mitamura, T., and Hauptmann, A. G. Easy samples first: Selfpaced reranking for zeroexample multimedia search. In ACM MM, 2014a.
 Jiang et al. (2014b) Jiang, L., Meng, D., Yu, S.I., Lan, Z., Shan, S., and Hauptmann, A. Selfpaced learning with diversity. In NeurIPS, 2014b.
 Jiang et al. (2018) Jiang, L., Zhou, Z., Leung, T., Li, L.J., and FeiFei, L. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.
 Kawaguchi et al. (2017) Kawaguchi, K., Kaelbling, L. P., and Bengio, Y. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.
 Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
 Kumar et al. (2010) Kumar, M. P., Packer, B., and Koller, D. Selfpaced learning for latent variable models. In NeurIPS, 2010.
 Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 Li et al. (2017) Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., and Li, L.J. Learning from noisy labels with distillation. In ICCV, 2017.
 Liang et al. (2016) Liang, J., Jiang, L., Meng, D., and Hauptmann, A. Learning to detect concepts from weblylabeled video data. In IJCAI, 2016.
 Lin et al. (2018) Lin, T.Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.
 Ma et al. (2018) Ma, X., Wang, Y., Houle, M. E., Zhou, S., Erfani, S. M., Xia, S.T., Wijewickrema, S., and Bailey, J. Dimensionalitydriven learning with noisy labels. In ICML, 2018.
 Malisiewicz et al. (2011) Malisiewicz, T., Gupta, A., and Efros, A. A. Ensemble of exemplarsvms for object detection and beyond. In ICCV, 2011.
 Natarajan et al. (2013) Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. In NeurIPS, 2013.
 Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. In NeurIPS, 2017.
 Novak et al. (2018) Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and SohlDickstein, J. Sensitivity and generalization in neural networks: an empirical study. In ICLR, 2018.
 Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. In ICLR, 2017.
 Reddi et al. (2016) Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. Stochastic variance reduction for nonconvex optimization. In ICML, 2016.
 Reed et al. (2015) Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. Training deep neural networks on noisy labels with bootstrapping. In ICLR workshop, 2015.
 Ren et al. (2018) Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In ICML, 2018.
 Shu et al. (2018) Shu, J., Xu, Z., and Meng, D. Small sample learning in big data era. arXiv preprint arXiv:1808.04572, 2018.
 Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. Prototypical networks for fewshot learning. In NeurIPS, 2017.
 Sukhbaatar & Fergus (2015) Sukhbaatar, S. and Fergus, R. Learning from noisy labels with deep neural networks. In ICLR workshop, 2015.
 Sun et al. (2007) Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. Costsensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358–3378, 2007.
 Vahdat (2017) Vahdat, A. Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS, 2017.
 Van Rooyen & Williamson (2018) Van Rooyen, B. and Williamson, R. C. A theory of learning with corrupted labels. Journal of Machine Learning Research, 2018.
 Veit et al. (2017) Veit, A., Alldrin, N., Chechik, G., Krasin, I., Gupta, A., and Belongie, S. J. Learning from noisy largescale datasets with minimal supervision. In CVPR, 2017.
 Wang et al. (2017) Wang, Y., Kucukelbir, A., and Blei, D. M. Robust probabilistic modeling with bayesian data reweighting. In ICML, 2017.
 Wu et al. (2018) Wu, L., Tian, F., Xia, Y., Fan, Y., Qin, T., JianHuang, L., and Liu, T.Y. Learning to teach with dynamic loss functions. In NeurIPS, 2018.
 Zadrozny (2004) Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In ICML, 2004.
 Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. In BMCV, 2016.
 Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
 Zhang & Sabuncu (2018) Zhang, Z. and Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, 2018.
 Zhuang et al. (2017) Zhuang, B., Liu, L., Li, Y., Shen, C., and Reid, I. D. Attend in groups: a weaklysupervised deep learning framework for learning from web data. In CVPR, 2017.