1 Introduction
In recent years, machine teaching zhu2015machine ; zhu2016teachingdim ; liu2017iterative , also known as learning to teach fan2018dataTeach ; wu2018lossTeach
, has become a popular topic in deep learning. Learning to teach is a metalearning paradigm that involves a teacher model and a student model. The student model is our final target and used for real tasks, like image classification
he2016Resnet , objection detection NIPS2015_5638 , etc., while the teacher model is used to guide the training of the student model through adjusting the weights of training data fan2018dataTeach ; metaweightnet ; jiang2018mentornet ; ren2018learning , generating better loss functions wu2018lossTeach , etc. These approaches have demonstrated promising results in image classification (with both clean and noisy labels) jiang2018mentornet ; metaweightnet , machine translation wu2018lossTeach , and text classification fan2018dataTeach .Previously, the teacher model only utilizes surface information derived from the student model. In fan2018dataTeach ; wu2018lossTeach ; metaweightnet ; jiang2018mentornet , the inputs of the teacher model include training iteration number, training loss (as well as the margin schapire1998boosting
), validation loss, the output of the student model, etc. In those algorithms, the teacher model does not leverage the internal states of the student model, e.g., the values of the secondtolast layer and even deeper layers far from the output layer of a neural network based student model. We notice that the internal states of a model have been widely investigated and shown its effectiveness in many deep learning algorithms and tasks. In ELMo
petersetal2018deep, a pretrained LSTM provides its internal states, the values of each layer, for downstream tasks as feature representations. In image captioning tasks
xu2015show ; anderson2018bottom , a faster RCNN NIPS2015_5638pretrained on ImageNet provides its internal states (i.e., meanpooled convolutional features) of the selected regions, serving as representations of images
anderson2018bottom . In knowledge distillation romero2015fitnets ; aguilar2019knowledge , a student model mimics the output of the internal layers of the teacher model so as to achieve comparable performances with the teacher model. Unfortunately, this kind of deep information is missing in learning to teach algorithms. The success of leveraging internal states in the above applications motivates us to investigate them in learning to teach, which leads to deep interactions between the teacher and student model.We propose a new data teaching algorithm, where the teacher model and the student model have deep interactions: the student model provides its internal states (i.e., the values of its secondtolast layer) and optionally values of its output layer which serve as the inputs of the teacher model, and the teacher model outputs adaptive weights of training samples which are used to enhance the training of the student model. Figure 1 illustrates the key difference between our algorithm (the right figure) and previous data teaching algorithm fan2018dataTeach ; wu2018lossTeach (the left figure). We decompose the student model into a
feature extractor, which can process the input to an internal state
, and a classifier (denoted as “cls”), which is a relatively shallow model like a linear classifier to map
to the final prediction . In previous data teaching algorithm, the teacher model only takes the surface information of the student model as inputs like training and validation loss, which are related to and ground truth label but not explicitly related to the internal states . In contrast, the teacher model in our algorithm leverages both the final outputs and the internal states of the student model as inputs. In this way, more information from the student model is accessible. In our algorithm, the teacher and the student models are jointly optimized in an alternative way, where the teacher model is updated according to the validation signal via reversemode differentiation maclaurin2015gradient , and the student model tries to minimize the loss on weighted data. Experimental results on CIFAR and CIFAR krizhevsky2009learning with both clean labels and noisy labels demonstrate the effectiveness of our algorithm. We achieve promising results over previous methods of learning to teach.2 Related work
Assigning weights to different data points have been widely investigated in literature, where the weights can be either continuous friedman2000additive ; jiang2018mentornet or binary fan2018dataTeach ; bengio2009curriculum . The weights can be explicitly bundled with data, like Boosting and AdaBoost methods freung1997decision ; hastie2009multi ; friedman2000additive
where the weights of incorrectly classified data are gradually increased, or implicitly achieved by controlling the sampling probability, like hard negative mining
malisiewicz2011ensemble where the harder examples in a previous round will be sampled again in the next round. As a comparison, in selfpaced learning (SPL) Kumar2010SPL , weights of hard examples will be assigned to zero in the early stage of training, and the threshold is gradually increased during the training process to control the student model to learn from easy to hard. An important motivation of data weighting is to increase the robustness of training, including addressing the problem of imbalanced data SUN20073358 ; dong2017class ; 8012579 , biased data zadrozny2004learning ; ren2018learning , noisy data angluin1988learning ; reed2014training ; sukhbaatar2014learning ; koh2017understanding .Except for manually designing weights for the data, there is another branch of work that leverages a meta model to assign weights. Learning to teach fan2018dataTeach is a learning paradigm where there is a student model for the real task, and a teacher model to guide the training of the student model. Based on the collected information, the teacher model provides signals to the student model, which can be the weights of training data fan2018dataTeach , adaptive loss functions wu2018lossTeach , etc. The general scheme of machine teaching is discussed and summarized in zhu2015machine . The concept of teaching can be found in label propagation gong2016label ; gong2016teaching , pedagogical teaching ho2016showing ; shafto2014rational , etc. liu2017iterative leverages a teaching way to speed up the training, where the teacher model selects the training data balancing the trade off between the difficulty and usefulness of the data. metaweightnet ; ren2018learning ; jiang2018mentornet mainly focuses on the setting that the data is biased or imbalanced. According to our knowledge, our work is the first one that the teacher and student can deeply interact. In previous work, the teacher model only access the surface information of the student model, while our teacher model can access the internal states of the student model. We will empirically verify the benefits of our proposals.
3 Our method
We focus on data teaching in this work, where the teacher model assigns an adaptive weight to each sample. We first introduce the notations used in this work, then describe our algorithm, and finally provide some discussions.
3.1 Notations
Let and denote the source domain and the target domain respectively. We want to learn a mapping , i.e., the student model, from and . W.l.o.g, we can decompose into a feature extractor and a decision maker, denoted as and respectively, where , , and is the dimension of the extracted feature. That is, for any , . We denote the parameters of as . In our work, we mainly work on the image classification problem. Given a classification network , the default segmentation method is that is the output of the secondtolast layer, and is a linear classifier taking as input.
Let denote the teacher model parameterized by , where is the internal states of a student model and is the surface information like training iteration, training loss, labels of the samples, etc. can map an input sample to a nonnegative scalar, representing the weight of the sample. Let denote the training loss on sample pair , and is a regularization term on , independent of the training samples.
Let and denote the training and validation sets respectively, both of which are subsets of with and samples. Denote the validation metric as , where and are the ground truth label and predicted label respectively. We require that should be a differentiable function w.r.t. the second input. can be specialized as the expected accuracy wu2018lossTeach or the loglikelihood on the validation set. Define as .
3.2 Algorithm
The teacher model outputs a weight for any input data. When facing a realworld machine learning problem, we need to fit a student model on the training data, select the best model according to validation performance, and apply it to the test set. Since the test set is not accessible during training and model selection, we need to maximize the validation performance of the student model. This can be formulated as a bilevel optimization problem:
(1)  
s.t.  
where
is a hyperparameter, and
represents the weight of data . The task of the student model is to minimize the loss on weighted data, as shown in the second line of Eqn.(1). Without a teacher, all ’s are fixed as one. In a learningtoteach framework, the parameters of the teacher model (i.e., ) and the student model (i.e., ) are jointly optimized. Eqn.(1) is optimized in an iterative way, where we calculate based on a given , then we update based on the obtained . We need to figure out how to obtain , and how to calculate .Obtaining : Considering a deep neural network is highly nonconvex, generally, we are not able to get the closedform solution of the in Eqn.(1). We choose stochastic gradient descend method (briefly, SGD) with momentum for optimization polyak1964some , which is an iterative algorithm. We use a subscript to denote the th step in optimization. is the data of the th minibatch, with the th sample in it. For ease of reference, denote
as a column vector, where the
th element is the weight for sample , and is another column vector with the element , both of which are defined in Eqn.(1). Following the implementation of PyTorch
NEURIPS2019_9015 , the update rule of momentum SGD is:(2) 
where . is the learning rate at the th step, and is the momentum coefficient. Assume we update the model for steps. We can eventually obtain , which serves as the proxy for . To stabilize the training, we will set .
Calculating : Motivated by reversemode differentiation maclaurin2015gradient , we use a recursive way to calculate gradients. For ease of reference, let and denote and
respectively. According to the chain rule to compute derivative, for any
, we have(3)  
According to Eqn.(3), we can design Algorithm 1 to calculate the gradients of the teacher model.
In Algorithm 1, we can see that we need a backpropagation interval as an input, indicating how many internal ’s are used to calculate the gradients of the teacher. When , all student models on the optimization trajectory will be leveraged. balances the tradeoff between efficiency and accuracy. To use Algorithm 1, we require .
As shown in step 2, we first calculate , with which we can initialize , and . We then recover the , and the gradients at the previous step (see step 4). Based on Eqn.(3), we recover the corresponding , and . We repeat step 4 and step 5 until getting the eventual , which is the gradient of the validation metric w.r.t. the parameters of the teacher model. Finally, we can leverage any gradientbased algorithm to update the teacher model. With the new teacher model, we can iteratively update and until reaching the stopping criteria.
In order to avoid calculating Hessian matrix, which neess to store parameters, we leverage the property that , where is the loss function related to , is a vector with size , and . With this trick, we only require GPU memory.
Discussions: Compared to previous work jiang2018mentornet ; metaweightnet ; fan2018dataTeach ; wu2018lossTeach , except for the key differences that we use internal states as features, there are some differences in optimization. In fan2018dataTeach
, the teacher is learned in a reinforcement learning manner, which is relatively hard to optimize. In
wu2018lossTeach , the student model is optimized with vanilla SGD, by which all the intermediate should be stored. In our algorithm, we use momentum SGD, where we only need to store the final and , by which we can recover all intermediate parameters. We will study how to effectively apply our derivations to more optimizers and more applications in the future.3.3 Teacher model
We introduce the default network architecture of the teacher model used in experiments. We use a linear model with sigmoid activation. Given a pair , we first use to extract the output of the secondtolast layer, i.e., . The surface feature we choose is the onehot representation of the label, i.e., . Then weight of the data is , where
denotes the sigmoid function,
, , , and are the parameters to be learned. can be regarded as an embedding matrix, which enriches the representations of the labels. One can easily extend the teacher model to a multilayer feedforward network by replacing with a deeper network.We need to normalize the weights within a minibatch. When a minibatch comes, after calculating the weight for the data , it is normalized as . This is to ensure that the sum of weights within a batch is always .
4 Experiments on CIFAR10/100 with clean labels
In this section, we conduct experiments on CIFAR and CIFAR image classification with clean labels. We first show the overall results and then provide several analysis.
4.1 Settings
There are and images in the training and test sets. CIFAR and CIFAR are a class and a class classification tasks respectively. We split samples from the training dataset as and the remaining samples are used as . Following he2016Resnet , we use momentum SGD with learning rate and divide the learning rate by at the th and
th epoch. The momentum coefficient
is . The in in Algorithm 1 are set as and respectively. We train the models for epochs to ensure convergence. The minibatch size is . We conducted experiments on ResNet32, ResNet110 and Wide ResNet2810 (WRN2810) BMVC2016_87 . All the models are trained on a single P40 GPU.We compare the results with the following baselines: (1) The baseline of data teaching fan2018dataTeach and loss function teaching wu2018lossTeach . They are denoted as L2Tdata and L2Tloss respectively. (2) Focal loss lin2017focal , where each data is weighted by , is the probability that the data is correctly classified, and is a hyperparameter. We search on suggested by lin2017focal . (3) Selfpaced learning (SPL) Kumar2010SPL , where we start from easy samples first and them move to harder examples.
4.2 Results
The test error rates of different settings are reported in Table 1. For CIFAR, we can see that the baseline results of ResNet32, ResNet110 and WRN2810 are , and respectively. With our method, we can obtain , and test error rates, which are the best among all listed algorithms. For CIFAR100, our approach can improve the baseline by , and points. These consistent improvements demonstrate the effectiveness of our method. We have the following observations: (1) L2Tdata is proposed to speed up the training. Therefore, we can see that the error rates are almost the same as the baselines. (2) For L2Tloss, on CIFAR10 and CIFAR100, it can achieve and point improvements, which are far behind of our proposed method. This shows the great advantage of our method than the previous learning to teach algorithms. (3) Focal loss sets weights to the data according to the hardness only, which does not leverage internal states neither. There exists nonnegligible gap between focal loss and our method. (4) For SPL, the results are similar (even worse) to the baseline. This shows the importance of a learning based scheme for data selection.
CIFAR  Baseline  L2Tdata fan2018dataTeach  L2Tloss wu2018lossTeach  Focal loss  SPL Kumar2010SPL  Ours 

ResNet32  
ResNet110  N/A  N/A  
WRN2810  N/A  N/A  
CIFAR  Baseline  L2Tdata fan2018dataTeach  L2Tloss wu2018lossTeach  Focal loss  SPL Kumar2010SPL  Ours 
ResNet32  
ResNet110  N/A  N/A  
WRN2810  N/A  N/A 
4.3 Analysis
To further verify how our method works, we conduct several ablation studies. All experiments are conducted on CIFAR with ResNet32.
Comparison with surface information: The features of the teacher model used in Table 1 are the output of the secondtolast layer of the network (denoted as ), and the label embedding (denoted as ). Based on metaweightnet ; ren2018learning ; wu2018lossTeach ; fan2018dataTeach , we define another group of features about surface information. Five components are included: the training iteration (normalized by the total number of iteration), average training loss until the current iteration, best validation accuracy until the current iteration, the predicted label of the current input, and the margin values. These surface features are denoted as .
For the teacher model, We try different combinations of the internal states and surface features. The settings and results are shown in Table 2.
Setting  

Error rate 
As shown in Table 2, we can see that the results of using surface features only (i.e., the settings without ) cannot catch up with those with internal states of the network (i.e., the settings with ). This shows the effectiveness of the internal states for learning to teach. We do not observe significant differences among the settings , and . Using only can result in less improvement than using . Combining , and also slightly hurts the result. Therefore, we choose as the default setting.
Internal states from different levels: By default, we use the output of secondtolast layer as the features of internal states. We also try several other variants, naming , and , which are the outputs of the last convolutional layer with size , and . A larger subscript represents that the corresponding features are more similar to the raw input. We explore the setting , . Results are reported in Table 4. We can see that leveraging internal states (i.e., ) can achieve lower test error rates than those without such features. Currently, there is not significant difference on where the internal states are from. Therefore, by default, we recommend to use the states from the secondtolast layer.
Architectures of the teacher models
: We explore the teacher networks with different number of hidden layers. Each hidden layer is followed by a ReLU activation (denoted as MLP#layer). The dimension of the hidden states are the same as the input. Results are in Table
4.Using a more complex teacher model will not bring improvement to the simplest one as we used in the default setting. Our conjecture is that more complex models are harder to optimize, which can not provide accurate signals for the student models.
Analysis on the weights: We take comparison between the weights output by the teacher model leveraging surface features only (denoted as ) and those output by our teacher leveraging internal features (denoted as ). The results are shown in Figure 2, where the top row represents the results of and the bottom row for . In Figure 2(a), (b), (d), (e), the data points of the same category are painted with the same color. The first column shows the correlation between the output data weight (axis) and the training loss (axis); the second column is used to visualize the internal states through tSNE maaten2008visualizing ; the third column plots heatmaps regarding output weights of all data points (red means large weight and blue means smaller), in accordance with those in the second column. We have the following observations:
(1) As shown in the first column, tries to assign lower weights to the data with higher loss, regardless of the category the image belongs to. In contrast, the weights set by heavily rely on the category information. For example, the data points with label have the highest weights regardless of the training loss, followed by those with label , where label and correspond to the “cat” and “dog” in CIFAR10, respectively.
(2) To further investigate the reason why the data of cat class and dog class are assigned with larger weights by , we turn to Figure 2(e), from which we can find that the internal states of dog and cat are much overlapped. We therefore hypothesize that, since the dog and cat are somewhat similar to each other, is learned to separate these two classes by assigning large weights to them. Yet, this phenomenon cannot be observed in .
Preliminary exploration on deeper interactions: To stabilize training, we do not backpropagate the gradients to the student model via the weights, i.e., is set as zero. If we enable , the teacher model will have another path to pass the supervision signal to the student model, which has great potential to improvement the performances. We quickly verify this variant on CIFAR using ResNet32. We choose as the features of the teacher model.We find that with this technique, we can further lower the test error rate to , another improvement compared to the current methods. We will further exploration this direction in the future.
5 Experiments on CIFAR10/100 with noisy labels
To verify the ability of our proposed method to deal with the noisy data, we conduct several experiments on CIFAR10/100 datasets with noisy labels.
5.1 Settings
We derive most of the settings from metaweightnet . The images remain the same as those in standard CIFAR10/100, but then we introduce noise to the labels. This is to verify the effectiveness of an algorithm under the noisy setting. Two types of noise, the uniform noise and flip noise, will be introduced. For the validation and test sets, both the images and the labels are clean.

Uniform noise: We follow a common setting from zhang2017understanding . The label of each image is uniformly mapped to a random class with probability . In our experiments, we set the probability as and . Following metaweightnet , the network architecture of the student network is WRN2810. We use momentum SGD with learning rate and divide the learning rate by at th epoch and th epoch ( epoch in total).

Flip noise: We follow metaweightnet to set flip noise. The label of each image is independently flipped to two similar classes with probability . The two similar classes are randomly chosen, and we flip labels to them with equal probability. In our experiments, we set probability as and and adopt ResNet32 as the student model. We use momentum SGD with learning rate and divide the learning rate by at th epoch and th epoch ( epoch in total).
For the teacher model, we follow settings in Section 4. We compare the results with MentorNet jiang2018mentornet and MetaWeightNet metaweightnet .
5.2 Results
The results are shown in Table 5 and Table 6. We can see that our results are better than the previous baselines like MentorNet and MetaWeightNet, regardless of the type and magnitude. When the noise type is uniform, we can improve MetaWeightNet by about point. On flip noise with ResNet32 network, the improvement is more significant, where in most cases, we can improve the baseline by more than one point.
The experiment results demonstrate that leveraging internal states is also useful for the datasets with noisy labels. This shows the generality of our proposed method.
CIFAR  CIFAR  

Method  
Baseline  
MentorNet jiang2018mentornet  
MetaWeightNet metaweightnet  
Ours 
CIFAR  CIFAR  

Method  
Baseline  
MentorNet jiang2018mentornet  
MetaWeightNet metaweightnet  
Ours 
6 Conclusion and future work
We propose a new data teaching paradigm, where the teacher and student model have deep interactions. The internal states are fed into the teacher model to calculate the weights of the data, and we propose an algorithm to jointly optimize the two models. Experiments on CIFAR and CIFAR with clean and noisy labels demonstrate the effectiveness of our approach. Rich ablation studies are conducted in this work.
For future work, the first one is to study how to apply deeper interaction to the learning to teach framework (preliminary results in Section 4.3). Second, we want that the teacher model could be transferred across different tasks, which is lacked for the current teacher (see Appendix B for the exploration). Third, we will carry out theoretical analysis on the convergence of the optimization algorithm.
Appendix
Appendix A Ablation study on different and
In this section, we conduct an ablation study to explore the impact of different model update interval and backpropagation interval on our algorithm. We adopt CIFAR and ResNet32 as the base dataset and student model respectively. We choose in our ablation study. In Table 1 in the main paper, we use and as the default setting.
Test error  

The ablation study results are reported in Table 7. We can observe that 1) The setting that run backpropagation at each step () takes high computational cost and is hard to optimize the teacher model. 2) Our default setting can reach the lowest test error rate among all settings.
Appendix B Transferability of the teacher across different tasks
In this section, we conduct some experiments to explore the transferability of our teacher models across different tasks.
We choose our best setting (the dataset is CIFAR, and the architecture of the student model is ResNet32) in Table 1 as the original teacher model, and adopt two transfer settings.
(1) Transfer to different dataset: We transfer our original teacher model from CIFAR to CIFAR dataset. The network architecture of the student model remains unchanged.
(2) Transfer to different student model: We change the student model architecture from ResNet32 to ResNet110. The dataset remains unchanged.
In the above two settings, we train the student models from scratch and fix the parameters of the teacher models. The teacher models provide weights for the input data.
The test error rates of Transfer to different dataset and Transfer to different student model are and respectively. Our teacher models lack transferability due to deep interactions between the teacher and the student models. We will improve our algorithm to enhance the transferability in future.
References
 (1) Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Edward Guo. Knowledge distillation from internal representations. AAAI, 2020.

(2)
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen
Gould, and Lei Zhang.
Bottomup and topdown attention for image captioning and visual
question answering.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 6077–6086, 2018.  (3) Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
 (4) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
 (5) Qi Dong, Shaogang Gong, and Xiatian Zhu. Class rectification hard mining for imbalanced deep learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 1851–1860, 2017.
 (6) Yang Fan, Fei Tian, Tao Qin, XiangYang Li, and TieYan Liu. Learning to teach. In Sixth International Conference on Learning Representations, 2018.
 (7) Y Freung and R Shapire. A decisiontheoretic generalization of online learning and an application to boosting. J Comput Syst Sci, 55:119–139, 1997.

(8)
Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al.
Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).
The annals of statistics, 28(2):337–407, 2000.  (9) Chen Gong, Dacheng Tao, Wei Liu, Liu Liu, and Jie Yang. Label propagation via teachingtolearn and learningtoteach. IEEE transactions on neural networks and learning systems, 28(6):1452–1465, 2016.

(10)
Chen Gong, Dacheng Tao, Jie Yang, and Wei Liu.
Teachingtolearn and learningtoteach for multilabel propagation.
In
Thirtieth AAAI conference on artificial intelligence
, 2016.  (11) Trevor Hastie, Saharon Rosset, Ji Zhu, and Hui Zou. Multiclass adaboost. Statistics and its Interface, 2(3):349–360, 2009.
 (12) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 (13) Mark K Ho, Michael Littman, James MacGlashan, Fiery Cushman, and Joseph L Austerweil. Showing versus doing: Teaching by demonstration. In Advances in neural information processing systems, pages 3027–3035, 2016.
 (14) Lu Jiang, Zhengyuan Zhou, Thomas Leung, LiJia Li, and Li FeiFei. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In Thirtyfifth International Conference on Machine Learning, 2018.

(15)
S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, and R. Togneri.
Costsensitive learning of deep feature representations from imbalanced data.
IEEE Transactions on Neural Networks and Learning Systems, 29(8):3573–3587, 2018.  (16) Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1885–1894. JMLR. org, 2017.
 (17) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 (18) M. P. Kumar, Benjamin Packer, and Daphne Koller. Selfpaced learning for latent variable models. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1189–1197. Curran Associates, Inc., 2010.
 (19) TsungYi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
 (20) Ji Liu and Xiaojin Zhu. The teaching dimension of linear learners. Journal of Machine Learning Research, 17(162):1–25, 2016.
 (21) Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B Smith, James M Rehg, and Le Song. Iterative machine teaching. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2149–2158. JMLR. org, 2017.
 (22) Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 (23) Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015.
 (24) Tomasz Malisiewicz, Abhinav Gupta, and Alexei A Efros. Ensemble of exemplarsvms for object detection and beyond. In 2011 International conference on computer vision, pages 89–96. IEEE, 2011.
 (25) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. 2019.
 (26) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
 (27) Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 (28) Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
 (29) Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In Thirtyfifth International Conference on Machine Learning, 2018.
 (30) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015.
 (31) Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. ICLR, 2015.
 (32) Robert E Schapire, Yoav Freund, Peter Bartlett, Wee Sun Lee, et al. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5):1651–1686, 1998.
 (33) Patrick Shafto, Noah D Goodman, and Thomas L Griffiths. A rational account of pedagogical reasoning: Teaching by, and learning from, examples. Cognitive psychology, 71:55–89, 2014.
 (34) Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Metaweightnet: Learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems 32, pages 1919–1930. Curran Associates, Inc., 2019.
 (35) Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014.
 (36) Yanmin Sun, Mohamed S. Kamel, Andrew K.C. Wong, and Yang Wang. Costsensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358 – 3378, 2007.
 (37) Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai JianHuang, and TieYan Liu. Learning to teach with dynamic loss functions. In Advances in Neural Information Processing Systems 31, pages 6466–6477, 2018.
 (38) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
 (39) Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twentyfirst international conference on Machine learning, page 114, 2004.
 (40) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Edwin R. Hancock Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1–87.12. BMVA Press, September 2016.
 (41) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings. OpenReview.net, 2017.
 (42) Xiaojin Zhu. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.