Learning to Teach with Deep Interactions

07/09/2020 ∙ by Yang Fan, et al. ∙ Microsoft USTC 0

Machine teaching uses a meta/teacher model to guide the training of a student model (which will be used in real tasks) through training data selection, loss function design, etc. Previously, the teacher model only takes shallow/surface information as inputs (e.g., training iteration number, loss and accuracy from training/validation sets) while ignoring the internal states of the student model, which limits the potential of learning to teach. In this work, we propose an improved data teaching algorithm, where the teacher model deeply interacts with the student model by accessing its internal states. The teacher model is jointly trained with the student model using meta gradients propagated from a validation set. We conduct experiments on image classification with clean/noisy labels and empirically demonstrate that our algorithm makes significant improvement over previous data teaching methods.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, machine teaching zhu2015machine ; zhu2016teachingdim ; liu2017iterative , also known as learning to teach fan2018dataTeach ; wu2018lossTeach

, has become a popular topic in deep learning. Learning to teach is a meta-learning paradigm that involves a teacher model and a student model. The student model is our final target and used for real tasks, like image classification 

he2016Resnet , objection detection NIPS2015_5638 , etc., while the teacher model is used to guide the training of the student model through adjusting the weights of training data fan2018dataTeach ; metaweightnet ; jiang2018mentornet ; ren2018learning , generating better loss functions wu2018lossTeach , etc. These approaches have demonstrated promising results in image classification (with both clean and noisy labels) jiang2018mentornet ; metaweightnet , machine translation wu2018lossTeach , and text classification fan2018dataTeach .

Previously, the teacher model only utilizes surface information derived from the student model. In fan2018dataTeach ; wu2018lossTeach ; metaweightnet ; jiang2018mentornet , the inputs of the teacher model include training iteration number, training loss (as well as the margin schapire1998boosting

), validation loss, the output of the student model, etc. In those algorithms, the teacher model does not leverage the internal states of the student model, e.g., the values of the second-to-last layer and even deeper layers far from the output layer of a neural network based student model. We notice that the internal states of a model have been widely investigated and shown its effectiveness in many deep learning algorithms and tasks. In ELMo 


, a pre-trained LSTM provides its internal states, the values of each layer, for downstream tasks as feature representations. In image captioning tasks 

xu2015show ; anderson2018bottom , a faster RCNN NIPS2015_5638

pre-trained on ImageNet provides its internal states (i.e., mean-pooled convolutional features) of the selected regions, serving as representations of images 

anderson2018bottom . In knowledge distillation romero2015fitnets ; aguilar2019knowledge , a student model mimics the output of the internal layers of the teacher model so as to achieve comparable performances with the teacher model. Unfortunately, this kind of deep information is missing in learning to teach algorithms. The success of leveraging internal states in the above applications motivates us to investigate them in learning to teach, which leads to deep interactions between the teacher and student model.

We propose a new data teaching algorithm, where the teacher model and the student model have deep interactions: the student model provides its internal states (i.e., the values of its second-to-last layer) and optionally values of its output layer which serve as the inputs of the teacher model, and the teacher model outputs adaptive weights of training samples which are used to enhance the training of the student model. Figure 1 illustrates the key difference between our algorithm (the right figure) and previous data teaching algorithm fan2018dataTeach ; wu2018lossTeach (the left figure). We decompose the student model into a

Figure 1: Comparison between the previous teacher model fan2018dataTeach (left) and ours (right).

feature extractor, which can process the input to an internal state

, and a classifier (denoted as “cls”), which is a relatively shallow model like a linear classifier to map

to the final prediction . In previous data teaching algorithm, the teacher model only takes the surface information of the student model as inputs like training and validation loss, which are related to and ground truth label but not explicitly related to the internal states . In contrast, the teacher model in our algorithm leverages both the final outputs and the internal states of the student model as inputs. In this way, more information from the student model is accessible. In our algorithm, the teacher and the student models are jointly optimized in an alternative way, where the teacher model is updated according to the validation signal via reverse-mode differentiation maclaurin2015gradient , and the student model tries to minimize the loss on weighted data. Experimental results on CIFAR- and CIFAR- krizhevsky2009learning with both clean labels and noisy labels demonstrate the effectiveness of our algorithm. We achieve promising results over previous methods of learning to teach.

The remaining part is organized as follows. Related work is discussed in Section 2. Our algorithm is introduced in Section 3. The experiments with clean labels and noisy labels are reported in Section 4 and Section 5 respectively. Section 6 concludes this paper and discusses future directions.

2 Related work

Assigning weights to different data points have been widely investigated in literature, where the weights can be either continuous friedman2000additive ; jiang2018mentornet or binary fan2018dataTeach ; bengio2009curriculum . The weights can be explicitly bundled with data, like Boosting and AdaBoost methods freung1997decision ; hastie2009multi ; friedman2000additive

where the weights of incorrectly classified data are gradually increased, or implicitly achieved by controlling the sampling probability, like hard negative mining 

malisiewicz2011ensemble where the harder examples in a previous round will be sampled again in the next round. As a comparison, in self-paced learning (SPL) Kumar2010SPL , weights of hard examples will be assigned to zero in the early stage of training, and the threshold is gradually increased during the training process to control the student model to learn from easy to hard. An important motivation of data weighting is to increase the robustness of training, including addressing the problem of imbalanced data SUN20073358 ; dong2017class ; 8012579 , biased data zadrozny2004learning ; ren2018learning , noisy data angluin1988learning ; reed2014training ; sukhbaatar2014learning ; koh2017understanding .

Except for manually designing weights for the data, there is another branch of work that leverages a meta model to assign weights. Learning to teach fan2018dataTeach is a learning paradigm where there is a student model for the real task, and a teacher model to guide the training of the student model. Based on the collected information, the teacher model provides signals to the student model, which can be the weights of training data fan2018dataTeach , adaptive loss functions wu2018lossTeach , etc. The general scheme of machine teaching is discussed and summarized in zhu2015machine . The concept of teaching can be found in label propagation gong2016label ; gong2016teaching , pedagogical teaching ho2016showing ; shafto2014rational , etc. liu2017iterative leverages a teaching way to speed up the training, where the teacher model selects the training data balancing the trade off between the difficulty and usefulness of the data. metaweightnet ; ren2018learning ; jiang2018mentornet mainly focuses on the setting that the data is biased or imbalanced. According to our knowledge, our work is the first one that the teacher and student can deeply interact. In previous work, the teacher model only access the surface information of the student model, while our teacher model can access the internal states of the student model. We will empirically verify the benefits of our proposals.

3 Our method

We focus on data teaching in this work, where the teacher model assigns an adaptive weight to each sample. We first introduce the notations used in this work, then describe our algorithm, and finally provide some discussions.

3.1 Notations

Let and denote the source domain and the target domain respectively. We want to learn a mapping , i.e., the student model, from and . W.l.o.g, we can decompose into a feature extractor and a decision maker, denoted as and respectively, where , , and is the dimension of the extracted feature. That is, for any , . We denote the parameters of as . In our work, we mainly work on the image classification problem. Given a classification network , the default segmentation method is that is the output of the second-to-last layer, and is a linear classifier taking as input.

Let denote the teacher model parameterized by , where is the internal states of a student model and is the surface information like training iteration, training loss, labels of the samples, etc. can map an input sample to a non-negative scalar, representing the weight of the sample. Let denote the training loss on sample pair , and is a regularization term on , independent of the training samples.

Let and denote the training and validation sets respectively, both of which are subsets of with and samples. Denote the validation metric as , where and are the ground truth label and predicted label respectively. We require that should be a differentiable function w.r.t. the second input. can be specialized as the expected accuracy wu2018lossTeach or the log-likelihood on the validation set. Define as .

3.2 Algorithm

The teacher model outputs a weight for any input data. When facing a real-world machine learning problem, we need to fit a student model on the training data, select the best model according to validation performance, and apply it to the test set. Since the test set is not accessible during training and model selection, we need to maximize the validation performance of the student model. This can be formulated as a bi-level optimization problem:



is a hyperparameter, and

represents the weight of data . The task of the student model is to minimize the loss on weighted data, as shown in the second line of Eqn.(1). Without a teacher, all ’s are fixed as one. In a learning-to-teach framework, the parameters of the teacher model (i.e., ) and the student model (i.e., ) are jointly optimized. Eqn.(1) is optimized in an iterative way, where we calculate based on a given , then we update based on the obtained . We need to figure out how to obtain , and how to calculate .

Obtaining : Considering a deep neural network is highly non-convex, generally, we are not able to get the closed-form solution of the in Eqn.(1). We choose stochastic gradient descend method (briefly, SGD) with momentum for optimization polyak1964some , which is an iterative algorithm. We use a subscript to denote the -th step in optimization. is the data of the -th minibatch, with the -th sample in it. For ease of reference, denote

as a column vector, where the

-th element is the weight for sample , and is another column vector with the -element , both of which are defined in Eqn.(1

). Following the implementation of PyTorch 

NEURIPS2019_9015 , the update rule of momentum SGD is:


where . is the learning rate at the -th step, and is the momentum coefficient. Assume we update the model for steps. We can eventually obtain , which serves as the proxy for . To stabilize the training, we will set .

Calculating : Motivated by reverse-mode differentiation maclaurin2015gradient , we use a recursive way to calculate gradients. For ease of reference, let and denote and

respectively. According to the chain rule to compute derivative, for any

, we have


According to Eqn.(3), we can design Algorithm 1 to calculate the gradients of the teacher model.

1 Input

: Teacher model backpropagation interval

; parameters and momentum of the student model and ; learning rates ; momentum coefficient (); minibatches of data ;
2 Initialization: ;
3 for  do
4       ; ; ;
5       ; ; ;
Return .
Algorithm 1 The gradients of the validation metric w.r.t. the parameters of the teacher.

In Algorithm 1, we can see that we need a backpropagation interval as an input, indicating how many internal ’s are used to calculate the gradients of the teacher. When , all student models on the optimization trajectory will be leveraged. balances the tradeoff between efficiency and accuracy. To use Algorithm 1, we require .

As shown in step 2, we first calculate , with which we can initialize , and . We then recover the , and the gradients at the previous step (see step 4). Based on Eqn.(3), we recover the corresponding , and . We repeat step 4 and step 5 until getting the eventual , which is the gradient of the validation metric w.r.t. the parameters of the teacher model. Finally, we can leverage any gradient-based algorithm to update the teacher model. With the new teacher model, we can iteratively update and until reaching the stopping criteria.

In order to avoid calculating Hessian matrix, which neess to store parameters, we leverage the property that , where is the loss function related to , is a vector with size , and . With this trick, we only require GPU memory.

Discussions: Compared to previous work jiang2018mentornet ; metaweightnet ; fan2018dataTeach ; wu2018lossTeach , except for the key differences that we use internal states as features, there are some differences in optimization. In fan2018dataTeach

, the teacher is learned in a reinforcement learning manner, which is relatively hard to optimize. In

wu2018lossTeach , the student model is optimized with vanilla SGD, by which all the intermediate should be stored. In our algorithm, we use momentum SGD, where we only need to store the final and , by which we can recover all intermediate parameters. We will study how to effectively apply our derivations to more optimizers and more applications in the future.

3.3 Teacher model

We introduce the default network architecture of the teacher model used in experiments. We use a linear model with sigmoid activation. Given a pair , we first use to extract the output of the second-to-last layer, i.e., . The surface feature we choose is the one-hot representation of the label, i.e., . Then weight of the data is , where

denotes the sigmoid function,

, , , and are the parameters to be learned. can be regarded as an embedding matrix, which enriches the representations of the labels. One can easily extend the teacher model to a multi-layer feed-forward network by replacing with a deeper network.

We need to normalize the weights within a minibatch. When a minibatch comes, after calculating the weight for the data , it is normalized as . This is to ensure that the sum of weights within a batch is always .

4 Experiments on CIFAR-10/100 with clean labels

In this section, we conduct experiments on CIFAR- and CIFAR- image classification with clean labels. We first show the overall results and then provide several analysis.

4.1 Settings

There are and images in the training and test sets. CIFAR- and CIFAR- are a -class and a -class classification tasks respectively. We split samples from the training dataset as and the remaining samples are used as . Following he2016Resnet , we use momentum SGD with learning rate and divide the learning rate by at the -th and

-th epoch. The momentum coefficient

is . The in in Algorithm 1 are set as and respectively. We train the models for epochs to ensure convergence. The minibatch size is . We conducted experiments on ResNet-32, ResNet-110 and Wide ResNet-28-10 (WRN-28-10) BMVC2016_87 . All the models are trained on a single P40 GPU.

We compare the results with the following baselines: (1) The baseline of data teaching fan2018dataTeach and loss function teaching wu2018lossTeach . They are denoted as L2T-data and L2T-loss respectively. (2) Focal loss lin2017focal , where each data is weighted by , is the probability that the data is correctly classified, and is a hyperparameter. We search on suggested by lin2017focal . (3) Self-paced learning (SPL) Kumar2010SPL , where we start from easy samples first and them move to harder examples.

4.2 Results

The test error rates of different settings are reported in Table 1. For CIFAR-, we can see that the baseline results of ResNet-32, ResNet-110 and WRN-28-10 are , and respectively. With our method, we can obtain , and test error rates, which are the best among all listed algorithms. For CIFAR-100, our approach can improve the baseline by , and points. These consistent improvements demonstrate the effectiveness of our method. We have the following observations: (1) L2T-data is proposed to speed up the training. Therefore, we can see that the error rates are almost the same as the baselines. (2) For L2T-loss, on CIFAR-10 and CIFAR-100, it can achieve and point improvements, which are far behind of our proposed method. This shows the great advantage of our method than the previous learning to teach algorithms. (3) Focal loss sets weights to the data according to the hardness only, which does not leverage internal states neither. There exists non-negligible gap between focal loss and our method. (4) For SPL, the results are similar (even worse) to the baseline. This shows the importance of a learning based scheme for data selection.

CIFAR- Baseline L2T-data fan2018dataTeach L2T-loss wu2018lossTeach Focal loss SPL Kumar2010SPL Ours
ResNet-110 N/A N/A
WRN-28-10 N/A N/A
CIFAR- Baseline L2T-data fan2018dataTeach L2T-loss wu2018lossTeach Focal loss SPL Kumar2010SPL Ours
ResNet-110 N/A N/A
WRN-28-10 N/A N/A
Table 1: Results on CIFAR-/CIFAR-. The labels are clean.

4.3 Analysis

To further verify how our method works, we conduct several ablation studies. All experiments are conducted on CIFAR- with ResNet-32.

Comparison with surface information: The features of the teacher model used in Table 1 are the output of the second-to-last layer of the network (denoted as ), and the label embedding (denoted as ). Based on metaweightnet ; ren2018learning ; wu2018lossTeach ; fan2018dataTeach , we define another group of features about surface information. Five components are included: the training iteration (normalized by the total number of iteration), average training loss until the current iteration, best validation accuracy until the current iteration, the predicted label of the current input, and the margin values. These surface features are denoted as .

For the teacher model, We try different combinations of the internal states and surface features. The settings and results are shown in Table 2.

Error rate
Table 2: Ablation study on the usage of features.

As shown in Table 2, we can see that the results of using surface features only (i.e., the settings without ) cannot catch up with those with internal states of the network (i.e., the settings with ). This shows the effectiveness of the internal states for learning to teach. We do not observe significant differences among the settings , and . Using only can result in less improvement than using . Combining , and also slightly hurts the result. Therefore, we choose as the default setting.

Internal states from different levels: By default, we use the output of second-to-last layer as the features of internal states. We also try several other variants, naming , and , which are the outputs of the last convolutional layer with size , and . A larger subscript represents that the corresponding features are more similar to the raw input. We explore the setting , . Results are reported in Table 4. We can see that leveraging internal states (i.e., ) can achieve lower test error rates than those without such features. Currently, there is not significant difference on where the internal states are from. Therefore, by default, we recommend to use the states from the second-to-last layer.

Setting () Error rate
Table 3: Features from different levels.
Setting MLP-0 MLP-1 MLP-2 Error rate
Table 4: Teacher with various hidden layers.

Architectures of the teacher models

: We explore the teacher networks with different number of hidden layers. Each hidden layer is followed by a ReLU activation (denoted as MLP-#layer). The dimension of the hidden states are the same as the input. Results are in Table 


Using a more complex teacher model will not bring improvement to the simplest one as we used in the default setting. Our conjecture is that more complex models are harder to optimize, which can not provide accurate signals for the student models.

(a) Weight-loss curve,
(b) Internal features,
(c) Weight w.r.t. classes,
(d) Weight-loss curve,
(e) Internal features,
(f) Weight w.r.t. classes,
Figure 2: Visualization of weights and loss values.

Analysis on the weights: We take comparison between the weights output by the teacher model leveraging surface features only (denoted as ) and those output by our teacher leveraging internal features (denoted as ). The results are shown in Figure 2, where the top row represents the results of and the bottom row for . In Figure 2(a), (b), (d), (e), the data points of the same category are painted with the same color. The first column shows the correlation between the output data weight (-axis) and the training loss (-axis); the second column is used to visualize the internal states through t-SNE maaten2008visualizing ; the third column plots heatmaps regarding output weights of all data points (red means large weight and blue means smaller), in accordance with those in the second column. We have the following observations:

(1) As shown in the first column, tries to assign lower weights to the data with higher loss, regardless of the category the image belongs to. In contrast, the weights set by heavily rely on the category information. For example, the data points with label have the highest weights regardless of the training loss, followed by those with label , where label and correspond to the “cat” and “dog” in CIFAR-10, respectively.

(2) To further investigate the reason why the data of cat class and dog class are assigned with larger weights by , we turn to Figure 2(e), from which we can find that the internal states of dog and cat are much overlapped. We therefore hypothesize that, since the dog and cat are somewhat similar to each other, is learned to separate these two classes by assigning large weights to them. Yet, this phenomenon cannot be observed in .

Preliminary exploration on deeper interactions: To stabilize training, we do not backpropagate the gradients to the student model via the weights, i.e., is set as zero. If we enable , the teacher model will have another path to pass the supervision signal to the student model, which has great potential to improvement the performances. We quickly verify this variant on CIFAR- using ResNet-32. We choose as the features of the teacher model.We find that with this technique, we can further lower the test error rate to , another improvement compared to the current methods. We will further exploration this direction in the future.

5 Experiments on CIFAR-10/100 with noisy labels

To verify the ability of our proposed method to deal with the noisy data, we conduct several experiments on CIFAR-10/100 datasets with noisy labels.

5.1 Settings

We derive most of the settings from metaweightnet . The images remain the same as those in standard CIFAR-10/100, but then we introduce noise to the labels. This is to verify the effectiveness of an algorithm under the noisy setting. Two types of noise, the uniform noise and flip noise, will be introduced. For the validation and test sets, both the images and the labels are clean.

  1. Uniform noise: We follow a common setting from zhang2017understanding . The label of each image is uniformly mapped to a random class with probability . In our experiments, we set the probability as and . Following metaweightnet , the network architecture of the student network is WRN-28-10. We use momentum SGD with learning rate and divide the learning rate by at -th epoch and -th epoch ( epoch in total).

  2. Flip noise: We follow metaweightnet to set flip noise. The label of each image is independently flipped to two similar classes with probability . The two similar classes are randomly chosen, and we flip labels to them with equal probability. In our experiments, we set probability as and and adopt ResNet-32 as the student model. We use momentum SGD with learning rate and divide the learning rate by at -th epoch and -th epoch ( epoch in total).

For the teacher model, we follow settings in Section 4. We compare the results with MentorNet jiang2018mentornet and Meta-Weight-Net metaweightnet .

5.2 Results

The results are shown in Table 5 and Table 6. We can see that our results are better than the previous baselines like MentorNet and Meta-Weight-Net, regardless of the type and magnitude. When the noise type is uniform, we can improve Meta-Weight-Net by about point. On flip noise with ResNet-32 network, the improvement is more significant, where in most cases, we can improve the baseline by more than one point.

The experiment results demonstrate that leveraging internal states is also useful for the datasets with noisy labels. This shows the generality of our proposed method.

MentorNet jiang2018mentornet
Meta-Weight-Net metaweightnet
Table 5: Results of WRN-28-10 on uniform noise label CIFAR-/ datasets.
MentorNet jiang2018mentornet
Meta-Weight-Net metaweightnet
Table 6: Results of ResNet-32 on flip noise label CIFAR-/ datasets.

6 Conclusion and future work

We propose a new data teaching paradigm, where the teacher and student model have deep interactions. The internal states are fed into the teacher model to calculate the weights of the data, and we propose an algorithm to jointly optimize the two models. Experiments on CIFAR- and CIFAR- with clean and noisy labels demonstrate the effectiveness of our approach. Rich ablation studies are conducted in this work.

For future work, the first one is to study how to apply deeper interaction to the learning to teach framework (preliminary results in Section 4.3). Second, we want that the teacher model could be transferred across different tasks, which is lacked for the current teacher (see Appendix B for the exploration). Third, we will carry out theoretical analysis on the convergence of the optimization algorithm.


Appendix A Ablation study on different and

In this section, we conduct an ablation study to explore the impact of different model update interval and backpropagation interval on our algorithm. We adopt CIFAR- and ResNet-32 as the base dataset and student model respectively. We choose in our ablation study. In Table 1 in the main paper, we use and as the default setting.

Test error
Table 7: Error rates of different and on CIFAR- and student model ResNet-32.

The ablation study results are reported in Table 7. We can observe that 1) The setting that run backpropagation at each step () takes high computational cost and is hard to optimize the teacher model. 2) Our default setting can reach the lowest test error rate among all settings.

Appendix B Transferability of the teacher across different tasks

In this section, we conduct some experiments to explore the transferability of our teacher models across different tasks.

We choose our best setting (the dataset is CIFAR-, and the architecture of the student model is ResNet-32) in Table 1 as the original teacher model, and adopt two transfer settings.

(1) Transfer to different dataset: We transfer our original teacher model from CIFAR- to CIFAR- dataset. The network architecture of the student model remains unchanged.

(2) Transfer to different student model: We change the student model architecture from ResNet-32 to ResNet-110. The dataset remains unchanged.

In the above two settings, we train the student models from scratch and fix the parameters of the teacher models. The teacher models provide weights for the input data.

The test error rates of Transfer to different dataset and Transfer to different student model are and respectively. Our teacher models lack transferability due to deep interactions between the teacher and the student models. We will improve our algorithm to enhance the transferability in future.


  • (1) Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Edward Guo. Knowledge distillation from internal representations. AAAI, 2020.
  • (2) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 6077–6086, 2018.
  • (3) Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
  • (4) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  • (5) Qi Dong, Shaogang Gong, and Xiatian Zhu. Class rectification hard mining for imbalanced deep learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 1851–1860, 2017.
  • (6) Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In Sixth International Conference on Learning Representations, 2018.
  • (7) Y Freung and R Shapire. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci, 55:119–139, 1997.
  • (8) Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al.

    Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).

    The annals of statistics, 28(2):337–407, 2000.
  • (9) Chen Gong, Dacheng Tao, Wei Liu, Liu Liu, and Jie Yang. Label propagation via teaching-to-learn and learning-to-teach. IEEE transactions on neural networks and learning systems, 28(6):1452–1465, 2016.
  • (10) Chen Gong, Dacheng Tao, Jie Yang, and Wei Liu. Teaching-to-learn and learning-to-teach for multi-label propagation. In

    Thirtieth AAAI conference on artificial intelligence

    , 2016.
  • (11) Trevor Hastie, Saharon Rosset, Ji Zhu, and Hui Zou. Multi-class adaboost. Statistics and its Interface, 2(3):349–360, 2009.
  • (12) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • (13) Mark K Ho, Michael Littman, James MacGlashan, Fiery Cushman, and Joseph L Austerweil. Showing versus doing: Teaching by demonstration. In Advances in neural information processing systems, pages 3027–3035, 2016.
  • (14) Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Thirty-fifth International Conference on Machine Learning, 2018.
  • (15) S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, and R. Togneri.

    Cost-sensitive learning of deep feature representations from imbalanced data.

    IEEE Transactions on Neural Networks and Learning Systems, 29(8):3573–3587, 2018.
  • (16) Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org, 2017.
  • (17) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • (18) M. P. Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1189–1197. Curran Associates, Inc., 2010.
  • (19) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • (20) Ji Liu and Xiaojin Zhu. The teaching dimension of linear learners. Journal of Machine Learning Research, 17(162):1–25, 2016.
  • (21) Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B Smith, James M Rehg, and Le Song. Iterative machine teaching. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2149–2158. JMLR. org, 2017.
  • (22) Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • (23) Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015.
  • (24) Tomasz Malisiewicz, Abhinav Gupta, and Alexei A Efros. Ensemble of exemplar-svms for object detection and beyond. In 2011 International conference on computer vision, pages 89–96. IEEE, 2011.
  • (25) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. 2019.
  • (26) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
  • (27) Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • (28) Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
  • (29) Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In Thirty-fifth International Conference on Machine Learning, 2018.
  • (30) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015.
  • (31) Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. ICLR, 2015.
  • (32) Robert E Schapire, Yoav Freund, Peter Bartlett, Wee Sun Lee, et al. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5):1651–1686, 1998.
  • (33) Patrick Shafto, Noah D Goodman, and Thomas L Griffiths. A rational account of pedagogical reasoning: Teaching by, and learning from, examples. Cognitive psychology, 71:55–89, 2014.
  • (34) Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems 32, pages 1919–1930. Curran Associates, Inc., 2019.
  • (35) Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014.
  • (36) Yanmin Sun, Mohamed S. Kamel, Andrew K.C. Wong, and Yang Wang. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358 – 3378, 2007.
  • (37) Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning to teach with dynamic loss functions. In Advances in Neural Information Processing Systems 31, pages 6466–6477, 2018.
  • (38) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
  • (39) Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, page 114, 2004.
  • (40) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Edwin R. Hancock Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1–87.12. BMVA Press, September 2016.
  • (41) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
  • (42) Xiaojin Zhu. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.