1 Introduction
Our ability to train increasingly deep and increasingly large neural networks has led to substantial progress in AI over the past decade, and a number of techniques have been proposed to address challenges such as overfitting and the vanishing gradient problem, among others. In recent years, several works have considered the Teacher–Student training paradigm, based on the idea of distilling knowledge from teacher models to guide the optimization of a student model
[2, 1, 8, 3, 19]. The original motivation for this framework was the idea of teaching a small model to mimic the behavior of a larger model so as to speedup the inference and reduce the model size, all while retaining the result quality of the original model. Subsequent work adopted this framework to improve the effectiveness of a student model with identical architecture as the teacher model [17, 5]. This is achieved by first training a teacher model and then training a student model with identical architecture but differently initialized parameters, supervised by both the ground truth and the teacher’s knowledge. Beyond learning from one single teacher, some studies have shown that learning from multiple teachers yields a better student [18, 12]. Instead of this costly twostage process, recent work has considered Teacher–Student optimization in a single generation [11, 9, 16]. The core idea is to consider information about previous training updates to the current model as teacher signals for later training steps of the same neural network in one single generation.It has been shown that both teacher–student differences and the quality of the teacher are very important in Teacher–Student optimization [16]. If student and teacher are very similar, it is impossible for the former to learn from the latter. If the teacher exhibits poor performance, it may introduce noise confusing the student. However, it is difficult to guarantee both teacher–student differences and the quality of the teacher in a single generation. During the course of the training, the predictive quality of the model is expected to become better and better, and thus a highquality teacher ought to be a fairly recent one, while a dissimilar teacher should rather be far from the student. Previous works rely on just a single teacher, making it hard to simultaneously satisfy these two opposing principles.
In this paper, we propose a novel training regime named Long ShortTerm Sample Distillation (LSTSD), which instead draws on numerous teachers and better leverages knowledge from previous training. In particular, the method decomposes the past history of training updates into longterm knowledge and shortterm knowledge to guarantee teacher–student differences while simultaneously ensuring a high quality of the teacher. LSTSD divides the training process into several minigenerations, each of which consists of several training epochs, and each training sample is always guided by two teachers: a longterm teacher and a shortterm one. The longterm teacher signal comes from the last minigeneration and remains fixed during the course of a minigeneration, so as to provide a steadfast teacher signal and guarantee teacher–student differences. The shortterm teacher, in contrast, comes from the previous epoch and changes at every epoch, so as to provide more uptodate signals that are likely to be of higher quality. Additionally, motivated by you2017learning you2017learning, we conjecture that learning from numerous past snapshots from the previous training process leads to a better model. In our method, teacher signals for each sample come from different snapshots in the previous training process, and thus the model learns from a very diverse set teachers at the same time.
Specifically, in each epoch, we save the probability distribution produced by the corresponding snapshot for each sample when it is selected as training data to update the neural network. This will serve as the shortterm teacher in the next epoch, and remain uptodate at every epoch. Besides the shortterm teacher, in the last epoch of a minigeneration, we further save the probability distribution produced by the corresponding snapshot for each sample when it is selected to update the neural network. This will serve as the longterm teacher for the same sample when it is selected to update the model in the next minigeneration, and remains fixed within that minigeneration.
We conducted experiments across a range of different vision and NLP tasks with a diverse set of neural network architectures to verify the effectiveness and generalization ability of LSTSD. The experimental results demonstrate that LSTSD can improve the performance significantly and can generalize to many different tasks.
2 Related Work
In recent years, important advances in artificial intelligence have arisen simply from our ability to train models with more layers and parameters. To address the computational overhead of larger models, techniques such as deep compression
[6] have been proposed. To address the optimization challenges of training increasingly deep neural networks, a number of techniques have been proposed as well. For instance, residual networks [7] were proposed to alleviate the problem of vanishing gradients, and dropout [13] was proposed to reduce overfitting.In recent years, the Teacher–Student framework has shown great potential for accelerating the inference and improving the performance of neural networks. In this framework, the target model is supervised not only by the ground truth, but also by signals from a teacher model, which aims to help optimize the target model. The Teacher–Student framework was originally proposed to distill knowledge from a large teacher model and guide the training of a small student model, such that the small student model can approximate the result quality of the large model while allowing for inference on resourceconstrained devices such as cellphones. In their pioneering work, bucilua2006model bucilua2006model proposed to distill an ensemble of neural networks into a small neural network to accelerate the model. In many following works, the student model was taught to mimic the behavior of the teacher model by approximating the output or the internal state of the pretrained teacher model. For instance, in hinton2015distilling hinton2015distilling, the student model was trained to not only predict the ground truth label accurately, but also to produce a softmax distribution matching that of the teacher model as closely as possible. Instead of mimicking the output of the teacher model, romero2014fitnets romero2014fitnets proposed a method in which the student mimics the hidden layers of the teacher model.
Besides distilling a large teacher into a small student for accelerated inference on the network, subsequent studies have found that distilling a teacher into a student model of identical architecture also shows promise. In yim2017gift yim2017gift, a student model achieved faster convergence and greater accuracy by matching the hidden layers with those of a teacher model with identical architecture. furlanello2018born furlanello2018born proposed bornagain networks, in which a reinitialized student learns from a pretrained teacher of identical architecture, achieving better performance. Beyond learning from single teacher, you2017learning you2017learning showed that learning from multiple teachers leads to a better student. In their work, multiple teachers are combined via a voting strategy, and the student is required to mimic both the internal layers and outputs of multiple teachers.
All of the aforementioned Teacher–Student methods divided the overall training process into multiple generations: the teacher and the student generations. In the teacher generation, a teacher model is pretrained, while in the student generation, a student model is trained, supervised by the pretrained teacher model. This training regime however entails an additional computational burden, because a series of models need to be optimized one by one. To reduce the extra computational overhead, several methods have been proposed to implement Teacher–Student Optimization in one single generation. In these methods, information distilled from the previous training process serves as a teacher signal for subsequent training of the same generation. tarvainen2017mean tarvainen2017mean proposed the Mean Teacher approach, in which the moving average parameters of all snapshots in the previous training process is used as a teacher for later training of the same generation. yang2019snapshot yang2019snapshot proposed Snapshot Distillation, in which a training generation is divided into several minigenerations. During the training of each minigeneration, the parameters of the last snapshot model in the previous minigeneration serve as a teacher model. In Temporal Ensembles, for each sample, the teacher signal is the moving average probability produced by the snapshots when the sample was selected as training data in all previous epochs [11].
In this work, we propose Long ShortTerm Sample Distillation to obtain better samplelevel Teacher–Student optimization in one generation. With Long ShortTerm Sample Distillation, the teacher signal comes from two teachers: a longterm teacher and a shortterm one. The longterm teacher comes from the previous minigeneration and remains fixed within the range of the next minigeneration, aiming to provide a stable teacher signal and guarantee teacher–student differences. The shortterm teacher comes from the previous epoch and remains fixed only within the next epoch, aiming to provide a more uptodate teacher signal guaranteeing the teacher quality. It is worth mentioning that the teacher signals of each sample are produced by the snapshot when the sample was selected as training data. Thus, each sample has unique teachers, enabling the model to learn from numerous teachers at the same time.
3 Method
In this section, we introduce our proposed Long ShortTerm Sample Distillation (LSTSD) in detail. For background, we shall first briefly review MiniBatch SGD Optimization, Teacher–Student Optimization, and OneGeneration TeacherStudent Optimization. Subsequently, we will describe our novel Long ShortTerm Sample Distillation approach.
3.1 MiniBatch SGD Optimization
Consider a classification problem optimized with minibatch SGD. We have a training dataset consisting of samples and labels . Our goal is to find a function that generalizes well to unseen data, where is often a deep neural network parameterized by . One of the most widely used methods to learn is by minimizing the crossentropy between the predicted probability distribution and the ground truth using minibatch SGD.
Specifically, given a dataset with samples , we define the objective to optimize for as the crossentropy, i.e.,
where denotes the probability of the label predicted by the neural network.
To find good local optima of that generalize well to unseen data, minibatch SGD is usually invoked to minimize the objective . Specifically, in the th iteration, a minibatch is randomly sampled to train the model . First, we determine the objective of on ,
Then, we compute the gradient of with respect to , and adjust each parameter in in the direction of gradients,
where denotes the learning rate. At this point, the th iteration of optimization is completed. We simply repeat this procedure until some predefined stopping criteria are fulfilled, in order to obtain the sought optimal parameters .
3.2 Teacher–Student Optimization
The process of SGD searches over the parameter space to find a that best fits the given dataset . However, as the depth of neural networks and the amount of parameters increases, often overfits
. guo2017calibration guo2017calibration found that this may stem from the fact that the supervision is provided as onehot vectors, which forces the network to overwhelmingly prefer the true class over all other classes. This is often not an optimal choice because rich information of classlevel similarity is simply discarded.
One of the methods to address this issue is the Teacher–Student framework, where a teacher model provides complementary information to help the training of student model. Specifically, the objective of the student model is now not only to predict the ground truth label of each sample correctly, but also to mimic the behavior of the teacher model. One way to mimic the teacher is to approximate the probability distribution produced by the teacher. This is usually achieved by adding an extra term to minimize the divergence between the probability distributions of the teacher model and student model. The loss function of the student model can be formulated as
where denotes the teacher network parameterized by , and denotes the KLdivergence function measuring the divergence between the probability distributions of the teacher and student models. In the Teacher–Student framework, besides the onehot vector of the ground truth label, the student model receives the probability distribution of the teacher model as an additional form of supervision that is much smoother than a onehot vector and may mitigate the problem of overfitting.
In the flow chart of the Teacher–Student framework, the overall training process is usually divided into two generations: the teacher generation and the student generation, to train the teacher model and student model, respectively. However, this brings an additional computational time cost to the training process. To alleviate this issue, several methods have been proposed to implement Teacher–Student Optimization in one generation, which we shall refer to as OneGeneration TeacherStudent Optimization.
3.3 OneGeneration Teacher–Student Optimization
In OneGeneration Teacher–Student Optimization, there is no need to pretrain a distinct teacher model, as the teacher signals comes from the previous training process of the same generation instead. Specifically, suppose that at the th step, a minibatch is sampled to train the model. For each sample in the minibatch data , the supervision signal contains the ground truth label and the probability distribution of produced by the teacher snapshot .
where denotes parameters in the neural network at the th time step, denotes the parameters in the teacher snapshot of sample , which is a snapshot model at some time step in the previous training process.
The key question for OneGeneration Teacher–Student Optimization is how to choose the teacher snapshot for each sample: Should we use one teacher for all samples or unique teachers for each sample? Should we use a snapshot far in the past or near the present? In this work, we investigate these two problems and propose Long ShortTerm Sample Distillation to obtain better OneGeneration TeacherStudent Optimization.
3.4 Long ShortTerm Sample Distillation
In our proposed LSTSD method, each sample has two unique teachers: a longterm teacher and a shortterm one. The longterm teacher for a sample is the snapshot model when it was selected as training data in the last epoch of the previous minigeneration, and remains fixed in the next minigeneration. The shortterm teacher for a sample is the snapshot model when it was selected as training data in the previous epoch, and is updated at every epoch.
ShortTerm Teacher.
As illustrated in Figure 1, in the th epoch of the th minigeneration, the dataset is shuffled to ensure that samples are ordered randomly, which is denoted by , and the model is trained with minibatches sampled from sequentially. Suppose at the th step, data is selected as training data to update parameters in the corresponding snapshot model . Then, is used as the shortterm teacher for in the th epoch. Instead of saving , we maintain a shortterm teacher vector to retain the probability distribution of ,
where we use to denote the shortterm teacher vector of for clarity. Storing the probability distribution instead of the parameters eliminates the extra computational cost entailed by calculating the probability repeatedly in the th epoch. After the th epoch of training has completed, the shortterm teacher vector , which contains knowledge of all snapshots in the th epoch, will be used as the shortterm teacher in the th epoch. The shortterm teacher is updated at every epoch to remain uptodate.
LongTerm Teacher.
At the beginning of the last epoch in the th minigeneration (i.e., the th epoch in Figure 1), the training dataset is shuffled again into . Suppose that at the th step, data is selected as training data to update the parameters in the corresponding snapshot model . Then, will serve as the longterm teacher for and remain fixed in the th minigeneration. Instead of storing , we maintain a longterm teacher vector to capture the probability distribution of ,
where we use to denote the longterm teacher vector of for clarity. After the th minigeneration of training has completed, the longterm teacher vector , which contains knowledge of all snapshots in the last epoch of the th minigeneration, will be used as longterm teachers in the th minigeneration. The longterm teacher is updated only in the last epoch of every minigeneration, and remains unchanged in other epochs.
Long ShortTerm TeacherStudent Optimization.
In the th minigeneration, besides the ground truth supervision, each sample is provided a longterm teacher from the previous minigeneration and a shortterm teacher from the previous epoch as described above. Therefore, the model is required to not only correctly predict the ground truth label, but also to simultaneously approximate the probability distributions of the longterm teacher and shortterm teacher. Without loss of generality, let us consider the second epoch in the th epoch, i.e., the th epoch in Figure 1. The shortterm teacher comes from the th epoch, and the longterm teacher comes from the th epoch. At the beginning of the th epoch, the dataset is shuffled into . Suppose at the th step, a minibatch was sampled from to update the parameters. The supervision signals of each sample consists of three components: the ground truth label , the longterm teacher signal , and the shortterm teacher signal . The training objective of can be formulated as
(1)  
Here, and denote the weight of the longterm teacher signal and shortterm teacher signal, respectively, denotes the parameters in the corresponding snapshot model at the th step in the th epoch, and , represent the longterm teacher signal and shortterm teacher signal for sample , respectively. The LSTSD procedure is given more formally as Algorithm 1.
As indicated by Equation 1, each sample in has two teacher snapshots from the previous training process. The longterm teacher provides a stable signal establishing teacher–student differences, and the shortterm teacher provides a more uptodate signal guaranteeing the quality of teacher. Since the teachers for each sample are unique, and , , contain the same samples but in different order, the th minibatch in contains samples from different batches in and . Thus, the model may learn from numerous longterm teachers and shortterm teachers at the same time. Furthermore, since we always save the probability distribution of the samples rather than the parameters in teacher snapshots, there is no need to calculate the probability of teacher snapshots repeatedly. LSTSD brings almost no extra computational cost, making it more widely applicable in a variety of settings.
4 Experiments
Network  ResNet20  ResNet32  ResNet56  ResNet110  DenseNet100 

Vanilla  66.43  68.39  70.06  71.47  78.00 
Mean Teacher  68.37  70.26  72.00  72.57  76.80 
Snapshot Ensembles  67.46  69.49  70.45  71.91  78.00 
Temporal Ensembles  67.90  69.79  71.20  71.99  77.13 
Snapshot Distillation  68.24  69.84  70.78  72.48  78.83 
LSTSD  69.42  71.51  73.17  73.83  79.35 
To verify the effectiveness and generalization ability of our proposed Long ShortTerm Sample Distillation technique, we conducted a comprehensive series of experiments with different neural network architectures on both vision and NLP tasks. In this section, we introduce the baselines, experimental settings, and analyze the experimental results.
Network  Method  RTE  MRPC  SST2  CoLA 

BERT  Vanilla  72.20  86.03  93.00  58.54 
Mean Teacher  70.39  85.29  92.89  61.75  
Snapshot Ensembles  73.29  86.76  92.32  59.53  
Temporal Ensembles  71.50  85.78  93.11  60.56  
Snapshot Distillation  74.01  87.25  93.12  60.09  
LSTSD  74.73  89.22  93.35  61.59  
CNN  Vanilla  53.79  70.83  70.99  9.70 
Mean Teacher  54.87  71.81  70.41  9.32  
Snapshot Ensembles  55.60  70.83  71.67  10.51  
Temporal Ensembles  54.87  72.06  70.53  11.27  
Snapshot Distillation  56.68  73.77  71.67  12.81  
LSTSD  57.40  73.28  72.36  14.50 
4.1 Baselines
To evaluate our proposed LSTSD, we compared it with Mean Teacher [14], Temporal Ensembles [11], Snapshot Ensembles [9] and Snapshot Distillation [16].
The Mean Teacher approach generates the teacher model by calculating the moving weighted average parameters over all training steps, aiming to produce a more accurate teacher model than using the final weights directly and allowing the model to learn from all snapshots in previous training steps. Specifically, the parameters in the teacher model are computed as at the th iteration. As suggested by the original paper, we set .
Temporal Ensembles saves each sample’s moving average probability produced by the neural network when the sample was selected as training data to update the parameters in the previous training process, rather than saving the parameters of the neural network. Specifically, the moving average probability is computed as at every epoch, where denotes the moving average probability and denotes the probability at the current time step. As suggested by the original paper, we set .
Snapshot Ensembles divides the training process into several minigenerations, in each of which the model is trained with a cyclic learning rate to force the model to converge to different wellperforming local minima. After training, the last snapshots in each minigeneration are ensembled to boost the performance.
Similar to Snapshot Ensembles, Snapshot Distillation also divides the overall training process into several minigenerations. In each minigeneration, the last snapshot in the previous minigeneration is used as a teacher. To assure a difference between student and teacher, a cyclic learning rate is applied in each minigeneration.
4.2 Experimental Setup
We applied all methods to ResNets and DenseNets for vision tasks, and to CNNs and BERT for NLP tasks. For all baselines, we used the hyperparameters mentioned above, and for LSTSD, we set each minigeneration to
epochs. To better understand LSTSD, besides comparing it with these baselines, we also conducted experiments on several variants of LSTSD to measure the influence of longterm teacher, shortterm teacher and numerous teachers separately. Also, we did a sensitivity analysis on the length of minigeneration, to investigate the influence on performance of different length of minigeneration.Network  ResNet20  ResNet32  ResNet56  ResNet110 

LSTSD  69.42 (0.00)  71.51 (0.00)  73.17 (0.00)  73.83 (0.00) 
LSTSD (w/o Long)  69.09 (0.34)  71.16 (0.35)  73.15 (0.02)  73.35 (0.48) 
LSTSD (w/o Short)  68.82 (0.60)  70.71 (0.80)  72.79 (0.38)  73.23 (0.60) 
LSTSD (single)  67.85 (1.57)  69.88 (1.63)  70.66 (2.51)  72.25 (1.58) 
Computer Vision.
For vision, we evaluate LSTSD on the CIFAR100 dataset, which contains 60,000 RGB images of
size, split into a training set of 50,000 images and a testing set of 10,000 images. The images are uniformly distributed over all 100 labels, examples of which include
bottle, bed, clock, and apple. We investigate two groups of baseline models. The first group contains ResNets with different numbers of layers (20, 32, 56, 110) as baseline backbones, with architectures matching those of he2016deep he2016deep. The second group contains DenseNets with 100 layers, in which the base feature length and growth rate are 24 and 80, respectively [10]. ResNets are trained for 164 epochs with a batch size of , while DenseNets are trained for 300 epochs with a batch size of . We trained both ResNets and DenseNets using SGD with a weight decay of, a Nesterov momentum of 0.9 and a base learning rate of
, which was divided by at the , , of the training process.Standard data augmentation was applied in the training process, i.e., each image was symmetricallypadded with a 4pixel margin on each of the four sides. In the enlarged
image, a subregion with pixels is randomly cropped and flipped with a probability of .We set the length of each minigeneration to 40 epochs for Snapshot Ensembles and Snapshot Distillation following yang2019snapshot yang2019snapshot. The best weights of the teacher loss for all baselines were determined by grid search. We found the best , and length of minigeneration to 6 epochs for LSTSD using a residual network of 20 layers with grid search, and used the same setting for other network backbones. Following hinton2015distilling hinton2015distilling, we divided the teacher and student signal (in logits, the neural responses before the softmax) by a temperature coefficient in calculating the distilling losses, which has been proven effective to soften the teacher signal and student signal in Teacher–Student Optimization.
Natural Language Processing.
For NLP, we used the wellknown GLUE benchmark data [15]
, which is a collection of diverse natural language understanding tasks, including question answering, sentiment analysis, text similarity, and textual entailment. Among all datasets in GLUE, we selected several classification datasets to conduct experiments, including RTE, MRPC, CoLA, and SST2. We used BERT
[4] and CNNs as baseline backbones. BERT has 12 layers, each of which has 12 selfattention heads with the hidden layer size set to 768. We initialized BERT with the parameters provided in devlin2018bert devlin2018bert, which were trained with a Masked Language Model (MLM) objective on a large unannotated corpus. We optimized BERT using Adam for 50 epochs, with the base learning rate set to and batch size set to 64. We initialized CNNs randomly and optimized them using SGD for 50 epochs with a learning rate of and a batch size of 32. We set the temperature to , since there are only a few classes in datasets of GLUE, the probability distributions are much smoother than in datasets with a large number of classes, and no further softening is needed.4.3 Experimental Results
Computer Vision.
On vision tasks, as shown in Table 1, LSTSD brings consistent accuracy gains for all models, regardless of network backbones. Specifically, LSTSD achieves accuracies of , , , for residual networks with 20, 32, 56 and 110 layers, respectively, and for DenseNet100.
All methods outperform the vanilla networks of all layers, which demonstrates the effectiveness of introducing either longterm knowledge or shortterm knowledge from the previous training process of the same generation to help the optimization of neural networks. In Temporal Ensembles, the teacher signal quickly decays by 0.6 per epoch, which made it more like a shortterm signal guaranteeing the quality of teacher. In the Mean Teacher approach, the teacher signal decays by 0.999 at every iteration, which amounts to about 0.6 per epoch on a dataset with 500 iterations. Thus, Mean Teacher is also more like a shortterm teacher. Moreover, the teacher signal in Snapshot Distillation remains fixed in each minigeneration, which made it more like a longterm signal guaranteeing teacher–student differences. The fact that LSTSD outperforms all of these demonstrates the advantage of decomposing the teacher signal into a longterm and shortterm signals and leveraging both simultaneously.
Natural Language Processing.
On NLP tasks, as shown in Table 2, LSTSD also outperforms other methods on the four datasets. Specifically, when applied to BERT, LSTSD achieves accuracies of , , on RTE, MRPC, and SST2, respectively, outperforming all other baselines and vanilla BERT. It achieves a Matthew’s Correlation of on CoLA, which is comparable with Mean Teacher. Similarly, when applied to CNN, LSTSD outperforms all baselines and vanilla CNNs on RTE, SST2, CoLA, and is comparable to Snapshot Distillation on MRPC. It is worth mentioning that BERT and CNN are substantially different architectures, since the core of BERT is an attention mechanism, while the core of CNN are convolutions. Despite the great difference between BERT and CNN, LSTSD achieves consistent gains with both of them, which further establishes the generalization ability of LSTSD.
Analysis of Model Variants.
To better understand Long ShortTerm Sample Distillation, we conducted experiments on CIFAR100 using ResNet20 to evaluate the importance of the longterm teacher and shortterm teacher on the performance separately. Specifically, we set and , in order to evaluate the importance of the longterm teacher signal, denoted by LSTSD (w/o Long). Similarly, we evaluate the importance of the shortterm teacher signal by setting and , denoted by LSTSD (w/o Short). As shown in Table 3, eliminating longterm knowledge or shortterm knowledge degrades the performance significantly, suggesting that it is necessary to leverage both longterm and shortterm knowledge jointly.
In Long ShortTerm Sample Distillation, each sample has unique teachers, enabling the model to learn from numerous teachers. To validate whether the model benefits from numerous teachers, we compare LSTSD with a variant in which all samples learn from a single teacher. Specifically, rather than taking the snapshot when a sample was selected as training data as the teacher model for the sample, we use the last snapshot in the previous minigeneration as the longterm teacher and the last snapshot in the previous epoch as the shortterm teacher, such that all samples share the same longterm teacher and shortterm teacher in every epoch (denoting this method as LSTSD (single) in Table 3). The comparison between LSTSD (single) and LSTSD shows that replacing numerous teachers with one teacher degrades the performance significantly, which shows the advantage of learning from numerous teachers at the same time.
Sensitivity Analysis.
Teacher–student differences are closely related to the length of each minigeneration. Thus, it is necessary to investigate what the best choice for the length of each minigeneration is. We conducted a sensitivity analysis on the length of each minigeneration on CIFAR100 using ResNet20. As shown in Figure 2, LSTSD improves as the length of the minigeneration increases from to , and gradually declines as the length increases from to . This is because a too short minigeneration length cannot guarantee teacher–student difference, while a too long one may introduce teachers with too low quality, which might mislead the training process.
5 Conclusions
In this paper, we propose a novel training policy called Long ShortTerm Sample Distillation to train neural networks while relying on previous training updates for improved supervision. Our method decomposes the teacher signal for each sample from the previous training process into a longterm signal and a shortterm one. The longterm teacher signal provides a stable teacher signal and guarantees teacher–student differences, while the shortterm one ensures highquality teaching. Additionally, each sample has unique teachers, enabling the model to learn from numerous teachers over the course of training. The experimental results demonstrate the effectiveness of leveraging a longterm teacher and shortterm teacher simultaneously, and learning from numerous teachers at the same time.
References
 [1] (2014) Do deep nets really need to be deep?. In Advances in neural information processing systems, pp. 2654–2662. Cited by: §1.
 [2] (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §1.
 [3] (2017) Sobolev training for neural networks. In Advances in Neural Information Processing Systems, pp. 4278–4287. Cited by: §1.
 [4] (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.2.
 [5] (2018) Born again neural networks. arXiv preprint arXiv:1805.04770. Cited by: §1.
 [6] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.

[7]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §2.  [8] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
 [9] (2017) Snapshot ensembles: train 1, get m for free. arXiv preprint arXiv:1704.00109. Cited by: §1, §4.1.
 [10] (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.2.

[11]
(2016)
Temporal ensembling for semisupervised learning
. arXiv preprint arXiv:1610.02242. Cited by: §1, §2, §4.1.  [12] (2018) Knowledge distillation from multiple teachers using visual explanations. Ph.D. Thesis, Indian Institute of Technology Hyderabad. Cited by: §1.

[13]
(2014)
Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research
15 (1), pp. 1929–1958. Cited by: §2. 
[14]
(2017)
Mean teachers are better role models: weightaveraged consistency targets improve semisupervised deep learning results
. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §4.1.  [15] (2019) GLUE: a multitask benchmark and analysis platform for natural language understanding. In Proceedings of the 7th Internaltional Conference on Learning Representations., Cited by: §4.2.
 [16] (2019) Snapshot distillation: teacherstudent optimization in one generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2859–2868. Cited by: §1, §1, §4.1.

[17]
(2017)
A gift from knowledge distillation: fast optimization, network minimization and transfer learning
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §1.  [18] (2017) Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1285–1294. Cited by: §1.

[19]
(2016)
Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer
. arXiv preprint arXiv:1612.03928. Cited by: §1.