1 Introduction
Modern neural networks are often trained to minimize a crossentropy loss. We can interpret this crossentropy loss as the KL divergence from a target distribution over all the possible classes to the distribution predicted by a network. This interpretation arises a natural question: what should be this target distribution?
We argue that many, if not all, existing training algorithms for neural networks construct the aforementioned based on several heuristics. Specifically, in supervised learning, where neural networks are trained with labeled data, the target distribution is often a onehot vector, or a smoothed version of the onehot vector,
ie., label smoothing (inception; label_smoothing_investigation). In semisupervised learning, the target distributions, also known as pseudo labels, are often generated on unlabeled data by a sharpened or dampened teacher model trained on labeled data, eg. uda; mixmatch. All such constructions for target distributions are heuristics that are designed prior to training, and thus they share an inherent weakness: they cannot adapt to the learning state of the neural networks being trained.Conceptual behaviors of 3 methods on the TwoMoons dataset. There are 1000 red points and 1000 green points distributed onto two semicircles, out of which only 3 red points and 3 green points are labeled (the stars). A model can rely on both labeled and unlabeled points to find a classifier that best fits the data. The found classifiers are shown by the red and green regions.
Left: Supervised learning with these 6 points leads to a wrong classifier. Middle: Pseudo label performs even worse than supervised learning because it relies on supervised learning to label the unlabeled data and in this case, supervised learning makes some mistakes in the topleft corner and the bottomright corner of the figure. Right: Our method, Meta Pseudo Label (MPL), utilizes meta learning to train the pseudo labels throughout the course of the model’s learning such that the student model will perform well on the 6 labeled examples. MPL finds a better classifier.We propose to metalearn the target distributions. In particular, we design a teacher model that assigns distributions to input examples to train the main model, which we henceforth refer to as the student model. Throughout the course of the student’s training, the teacher observes the student’s performance on a heldout validation set, and learns to generate target distributions so that if the student learns from such distributions, the student will achieve good validation performance. Since the metalearned target distributions play the similar role to pseudo labels (pseudo_label; yarowsky1995unsupervised; Riloff1996), we name our method Meta Pseudo Label (MPL). MPL has an apparent advantage: the teacher model can adapt to the student’s learning state and can improve the student’s learning accordingly. Figure 1 demonstrates the behavior of MPL on the TwoMoons dataset. By adapting the target distributions to the student’s learning state, MPL learns a better classifier than supervised learning and pseudo label.
Our experiments demonstrate substantial improvements over strong baselines and establish stateoftheart performance on CIFAR10, SVHN, and ImageNet. For instance, with ResNets on small datasets, we achieve 96.1% on CIFAR10 with 4,000 labeled examples and 73.9% top1 on ImageNet with 10% labeled examples. Meanwhile, with EfficientNet on full datasets plus extra unlabeled data, we achieve 98.6% accuracy on CIFAR10 and 86.9% top1 accuracy on ImageNet.
2 Motivations
In this work, we focus on training a way classification model parameterized by , such as a neural network. Despite the wide spectrum of algorithms for training classification models, many of them can be summarized into minimizing the cross entropy between a target distribution and the model distribution , i.e.
Under this formulation, different algorithms simply correspond to specific instantiations of the target distribution:

[leftmargin=*]

In fully supervised training, the target distribution is defined as the onehot vector (single point distribution) representing observed / annotated value of the groundtruth class, i.e., for , .

In knowledge distillation (KD; knowledge_distillation), to compress the “dark knowledge” of a well trained larger model to a smaller one, for each data point , the predicted distribution of the large model is directly taken as the target distribution, i.e. .

In semisupervised learning (SSL), a typical solution first employs an existing model (trained on limited labeled data) to predict the class for each data point from an unlabeled set, and utilizes the prediction to construct the target distribution. There are two common versions:
While these classic target distributions generally work well, recent works find they are often not the optimal choices. Instead, some heuristic methods have been exploited to slightly adjust the target distribution and lead to improved performance. Here, we review two notable examples.
Label smoothing
It has been found that using the onehot vector as the target distribution above in fully supervised machine translation and largescale image classification such as ImageNet can lead to overfitting. To combat this phenomenon, label smoothing is proposed to smooth the onehot distribution by allocating a small amount of uniform weights to all classes, i.e., for , the target distribution is redefined as
However, while label smoothing often helps at convergence, it also results in slower training.
Temperature Tuning
For both KD and softlabel SSL, it has been found that explicitly introducing a temperature hyperparameter to modulate the target distribution could be very helpful. Specifically, let be
th logit predicted by the teacher model,
eg. the large model in KD and the existing model in SSL, then the target distribution is defined aswhere is the temperature that can be used to smooth () or sharpen () the distribution^{1}^{1}1Note that as , we can also recover the hardlabel case.. Intuitively, a smoother distribution could help to prevent overfitting or early mistakes in SSL. On the other hand, a sharper target could potentially speed up the training given it is correct.
From the success of these heuristic tricks, it is clear that how to construct the target distribution plays an important role in the algorithm design, and a proper method could lead to a sizable gain. Motivated from this observation, in this work, we focus on the construction of target distributions. In particular, instead of designing target distributions from scratch, we ask the question: whether there exists a generic and systematic method that can be used to modify the target distribution in an existing algorithm and lead to an improved target distribution and thus, to better performance.
As the first step towards this goal, we identify two intrinsic limits of many existing constructions:

[leftmargin=*]

The target distribution is either chosen prior to training and then kept fixed afterwards or annealed/updated during training with an adhoc procedure;

The modulation (smoothing or sharpening) of does not depend on the data point in consideration.
Ideally, should adapt to the learning state of . For example, when the model is already confident enough for a data point at a time step, the target distribution may need to be smoothed to avoid overfitting this specific training instance. Figure 3 illustrates such an overfitting scenario from the perspective of trainvalidation discrepancy where the gradient computed using target distribution could push the student into a bad local minimum, which could be prevented by an alternate and noisier direction. With such motivation and intuition in mind, we next turn to our proposed method.
3 Meta Pseudo Labels
Our solution to the shortcoming of manually constructing the target distribution is to learn throughout the course of training . In particular, we parameterize the target distribution as and train using gradient descent. In Section 3.2, we describe two different parameterizations of . For now, it is sufficient to treat
as a classification model, which assigns the conditional probabilities to different classes of each input example
. We train based on the following principle:If follows the gradients on training data , the resulting should achieve a small validation loss .
Clearly, serves the same role as , , and (Section 2), as provides the pseudo labels for to learn. Due to this similarity, we follow the existing literature to call the teacher model, and call the student model. Furthermore, the stated principle to optimize is essentially a metalearning problem (see Appendix A), we name our method Meta Pseudo Labels (MPL).
3.1 MPL’s Update Rules for Teacher and Student
As illustrated in Figure 2, each training step of MPL consists of two phases:
Phase 1: The Student Learns from the Teacher.
In this phase, given a single input example , the teacher produces the conditional class distribution to train the student. We note that the input does not need to come with any humanannotated label, as the teacher already computes its classdistribution . The pair is then shown to the student to update its parameters by backpropagating from the crossentropy loss. For instance, if is trained with SGD with a learning rate of , then we have:
(1) 
Phase 2: The Teacher Learns from the Student’s Validation Loss.
After the student updates its parameters as in Equation 1, its new parameter is evaluated on an example from the heldout validation dataset, using the crossentropy loss . Since depends on via Equation 1, this validation crossentropy loss is a function of . Specifically, dropping from the equations for readability, we can write:
(2)  
This dependency allows us to compute to update and minimize
. This differentiation requires computing the gradient of gradient, which can be implemented by modern automatic differentiation frameworks such as TensorFlow
(tensorflow).3.2 Instantiating the Teacher
While the student’s performance allows the teacher to adjust and adapt to the student’s learning state, this signal alone is not
sufficient to train the teacher. In essence, the teacher observing the student’s validation loss to improve itself is similar to an agent in reinforcement learning (RL) performing onpolicy sampling and learning from its own rewards. Due to the potentially high sampling complexity, when the teacher has observed enough evidence to produce meaningful target distributions to teach the student, the student might have already entered a bad region of parameters.
A similar shortfalling has been observed when training neural machine translation (NMT) models with RL
(dad_nmt; mixer_nmt). Similar to MPL, RL training leads to better selfadaptive behaviors of NMT models. However, training with RL requires onpolicy sampling from the NMT model, and hence would fail if the NMT model is not sufficiently trained a priori to produce reasonably correct samples. For this reason, NMT models must be trained in a supervised manner prior to being trained with RL (dad_nmt), or must be trained with a mixed signal from both RL and supervised learning throughout their courses of learning (mixer_nmt).Here, we follow mixer_nmt and add a supervised signal to MPL’s teacher. In particular, at each training step, apart from the MPL updates in Equation 2, the teacher also computes a gradient on a pair of labeled data . This gradient is then added to the MPL gradient from Equation 2 to update the teacher’s parameters . In practice, we use the student’s validation data to supervise the teacher, as illustrated in Figure 4. While the MPL algorithm interacts extensively with this socalled validation data, the student never directly learns from this validation set, effectively avoids overfitting. In fact, we observe no sign of overfitting in our experiments.
Adding the supervised signal to MPL introduces an implementation difficulty: we need to keep two classification models, the teacher and the student, in memory. While it is possible to train the pair of teacherstudent with small architectures such as ResNets, for architectures with large memory footprints, eg. EfficientNet (efficient_net), keeping two models limits the training batch size and leads to a slow training time. To allow training large models on large datasets, we design a more economical alternative to instantiate the teacher, termed ReducedMPL.
In ReducedMPL, as shown in Figure 5, we first train a large teacher model to convergence. Next, we use to precompute all target distributions for the student’s training data. Importantly, until this step, the student model has not been loaded into memory, effectively avoiding the large memory footprint of MPL. Then, we parameterize a reduced teacher as a small and efficient network, such as a multilayered perceptron (MLP), to be trained the along with student. This reduced teacher takes as input the distribution predicted by the large teacher and outputs a calibrated distribution for the student to learn. Intuitively, ReducedMPL works reasonably well because the large teacher is reasonably accurate, and hence many actions of the reduced teacher would be close to an identity map, which can be handled by an MLP. Meanwhile, ReducedMPL retains the benefit of MPL, as the teacher can still adapt to the learning state of the student .
4 Experiments
We demonstrate the effectiveness of MPL in two scenarios: 1) reduced datasets (Section 4.1): where limited labeled data is available, 2) full datasets (Section 4.2): where the full labeled data is used. In both scenarios, we experiment on CIFAR10 (cifar10), SVHN (svhn) and ImageNet (imagenet). For experiments on full datasets, we use ReducedMPL due to the large memory footprint of MPL. Our goal is to experimentally confirm the benefit of MPL, which we reemphasize as follows:
A teacher model is trained along with a student model to set the student’s target distributions and adapt to the student’s learning state.
4.1 Experiments with MPL on Reduced Datasets
We first compare MPL with existing semisupervised learning algorithms on standard benchmarks with reduced datasets: CIFAR10 with 4,000 labeled examples, SVHN with 1,000 labeled examples and ImageNet10%.
Experiment Details.
For CIFAR10 and SVHN, we use a preactivated WideResNet282 (WRN282) which has 1.5 million parameters (wide_res_net). For ImageNet, we use a ResNet50 which has 25.6 million parameters (res_net). We use 4,000 labeled examples from CIFAR10, 1,000 labeled examples from SVHN, and roughly 128,000 labeled examples from ImageNet, which is approximately of the whole ImageNet dataset. These images and their labels play two roles in our MPL training. First, they serve as validation data where the teacher measures the student’s performance (Equation 2). Second, they are also the labeled data for the teacher (Figure 4).
Baselines.
Our main baseline is Unsupervised Data Augmentation (UDA; uda). We choose UDA as our main baseline for its stateoftheart performance on the datasets and models in this section. UDA is a consistency regularization technique, which belongs to the category of semisupervised learning (Section 2). In addition to UDA, we consider 3 other baselines: supervised learning, label smoothing, and RandAugment (rand_augment). Our goal here is to show that MPL can improve the performance of all these methods, hence further confirm the advantage of the adaptive teacher in MPL. We reimplement all baselines in our environment, and allocate the same amount of resources to tune hyperparameters for all baselines. For each baseline, we compare the accuracy of the baseline with the accuracy of MPL’s student, where the student learns from a teacher trained with the baseline algorithm plus the MPL signal. Further details are in Appendix C.
Methods  CIFAR10  SVHN 

(4,000)  (1,000)  
Temporal Ensemble (temporal_ensemble)  
Mean Teacher (mean_teacher)  
VAT+EntMin (vat)  
LGA+VAT (lga)  
ICT (ict)  
MixMatch (mixmatch)  
Supervised  
Label Smoothing  
Supervised+MPL  83.71 0.21  91.89 0.14 
RandAugment (rand_augment)  
RandAugment+MPL  87.55 0.14  94.02 0.05 
UDA (uda)  
UDA+MPL  96.11 0.07  98.01 0.07 
Results on CIFAR10 and SVHN.
In Table 1, we present our results with MPL on CIFAR10 and SVHN, showing that MPL improves the accuracy of all baseline methods. For reference, we also include the results of a few other semisupervised learning methods in the first block of Table 1. However, since these methods do not share the same controlled environment, the comparison to them is not direct, and should be contextualized (realistic_eval).
We observe that with 4,000 labeled examples for CIFAR10 and 1,000 for SVHN, supervised training are prone to severe overfitting. Label smoothing and data augmentation, two of our baselines, are often utilized to reduce overfitting. From Table 1, we see that label smoothing improves the accuracy for SVHN, but fails to improve the accuracy of CIFAR10. In contrast, MPL outperforms label smoothing on both datasets by about 1.5%. Meanwhlie, RandAugment (rand_augment) significantly improves the accuracy on both CIFAR10 and SVHN, but MPL can further boost the accuracy by 2% on CIFAR10 and by 0.4% on SVHN. Finally, MPL improves over UDA by 1.5% on CIFAR10 and by 0.9% on SVHN. This improvement, along with the previous results, confirms our hypothesis about the benefit of MPL.
To our surprise, MPL even outperforms WRN282 trained on all labeled examples from CIFAR10 and SVHN. Specifically, on average, our WRN282 achieves 94.9% accuracy on full CIFAR10 and 97.4% on SVHN, which are lower than UDA+MPL’s accuracy, as reported in the last row of Table 1. This means that UDA+MPL can be more than 10x efficient in terms of data complexity.
Results on ImageNet10%.
The gain of MPL here is even more significant than on CIFAR10 and SVHN. As shown in Figure 6, MPL outperforms UDA by almost 6% in top1 accuracy, going from 68.07% to 73.89%. MPL also surpasses the best published top1 accuracy of 73.21%, achieved by selfsupervised semisupervised learning with a 4x wider ResNet50 (s4l).
MPL also continues to improve as more labeled data becomes available. In Figure 7, we further compare MPL to supervised learning and RandAugment on 20%, 40%, 80%, and 100% of the labeled examples in ImageNet. From the figure, it can be seen that MPL delivers substantial gains with less labeled data, but this gain dwindles as more labeled data becomes available.
4.2 Results with ReducedMPL on Full Datasets
To evaluate whether MPL can scale to problems with a large number of labeled examples, we now turn to full labeled sets of CIFAR10, SVHN and ImageNet. We use outofdomain unlabeled data for CIFAR10 and ImageNet. We experiment with ReducedMPL whose memory footprint allows our largescale experiments. We show that the benefit of MPL, ie., having a teacher that adapts to the student’s learning state throughout the student’s learning, stil extends to large datasets with more advanced architectures and outofdomain unlabeled data.
Model Architectures.
For our student model, we use EfficinetNetB0 for CIFAR10 and SVHN, and use EfficientNetB7 for ImageNet. Meanwhile, our teacher model is a small 5layer perceptron, with ReLU activation, and with a hidden size of 128 units for CIFAR10 and of 512 units for ImageNet.
Labeled Data.
Per standard practices, we reserve 4,000 examples of CIFAR10, 7,300 examples from SVHN, and 40 data shards of ImageNet for hyperparameter tuning. This leaves about 45,000 labeled examples for CIFAR10, 65,000 labeled examples for SVHN, and 1.23 million labeled examples for ImageNet. As in Section 4.1, these labeled data serve as both the validation data for the student and the pretraining data for the teacher.
Unlabeled Data.
For CIFAR10, our unlabeled data comes from the TinyImages dataset which has 80 million images (tinyimages). For SVHN, we use the extra images that come with the standard training set of SVHN which has about 530,000 images. For ImageNet, our unlabeled data comes from the YFCC100M dataset which has 100 million images (yfcc100m). To collect unlabeled data relevant to the tasks at hand, we use the pretrained teacher to assign class distributions to images in TinyImages and YFCC100M, and then keep images with highest probabilities for each class. The values of are 50,000 for CIFAR10, 35,000 for SVHN, and 12,800 for ImageNet.
Baselines.
We compare ReducedMPL to NoisyStudent (noisy_student). NoisyStudent is a selftraining approach (Section 2), which applies various regularization techniques to the student model. We choose NoisyStudent because it achieves a strong performance on ImageNet, and more importantly, because it can be directly compared to ReducedMPL. In fact, the only difference between NoisyStudent and ReducedMPL is that ReducedMPL has a teacher that adapts to the student’s learning state.
Methods  CIFAR10  SVHN  ImageNet 

Supervised  
NoisyStudent  98.71 0.11  
ReducedMPL  98.56 0.07  98.78 0.07  86.8798.11 
Results.
As presented in Table 2, ReducedMPL outperforms NoisyStudent on both CIFAR10 and ImageNet, and is onpar with NoisyStudent on SVHN. In particular, on ImageNet, MPL with EfficientNetB7 achieves a top1 accuracy of 86.87%, which is 1.06% better than the strong baseline NoisyStudent. On CIFAR10, MPL leads to an improvement of 0.34% in accuracy on NoisyStudent, marking a 19% error reduction.
For SVHN, we suspect there are two reasons of why the gain of ReducedMPL is not significant. First, NoisyStudent already achieves a very high accuracy. Second, the unlabeled images are highquality, which we know by manual inspection. Meanwhile, for many ImageNet categories, there are not sufficient images from YFCC100M, so we end up with lowquality or outofdomain images. On such noisy data, ReducedMPL’s adaptive adjustment becomes more crucial for the student’s performance, leading to more significant gain.
5 Analysis
Roadmap.
We seek to understand the reasons for MPL’s strong performance. First, in Section 5.1, we use mathematical reasoning to get an intuition of what MPL’s teacher tries to achieve. However, as we shall explain, it is challenging to empirically observe our intuition on largescale experiments. Instead, we provide empirical verification on a synthetic dataset, where our guess can be observed. Next, in Sections 5.2 and 5.3, we show some empirical behaviors of MPL on real datasets to reject two alternate and more trivial explanations of MPL’s strong performance.
5.1 Hypothesis: MPL Fits the Validation Gradient
We revisit Equation 2 from Section 3:
Denote . Under regulatory conditions, is a smooth map. This allows us to differentiate with respect to
using the chain rule:
(3)  
where is the Jacobian matrix of . Intuitively, this Jacobian quantifies how much a certain change in the teacher’s parameters affects the student’s training gradient. Thus, the product in Equation 3 quantifies how much the direction the teacher’s parameter should change to align the student’s training gradient with the student’s validation gradient gradient. In other words, in expectation, the teacher encourages the student’s training gradient to be similar to the student’s validation gradient .
This is a desired behavior, as we know that neural networks are overparameterized models which are prone to overfitting on the training set, and to combat such degenerating behavior, we use the validation set for model selection and hyperparameters tuning. MPL’s behavior provides an endtoend way to achieve a strong validation performance. Certainly, this behavior introduces a risk of overfitting to the validation set. However, as we will see in Section 5.2, this is not the case. We suspect that since the student never directly learns from the validation data, overfitting is avoided.
In Figure 8
, we plot the cosine similarity between these gradients on the synthetic dataset TwoMoons. Clearly, MPL gradually increases the similarity better than supervised learning. We
cannot observe this phenomenon on experiments with large datasets, such as those in Section 4. This is because training on those large datasets requires stochastic gradient updates on minibatches of data, and stochastic gradients are poor estimates of the correct training gradient that MPL tries to make similar to the validation gradient. In fact, we observe that gradients on training and validation minibatches are almost uncorrelated,
ie., their cosine similarity are close to .5.2 MPL is Not Label Corrections
Since the teacher in MPL provides the target distribution for the student to learn and observes the student’s performance to improve itself, it is intuitive to think that the teacher tries to guess the correct labels for the student. We empirically show that it is not the case. In Figure 9, we visualize the training accuracy of a purely supervised model, as well as of the teacher and the student model in MPLon CIFAR10 (4,000) and ImageNet (10%). These accuracy are the result of taking of the models’ predictions on validation data throughout their training. As shown, the training accuracy of both the teacher and the student of MPL stay relatively low. Meanwhile, the training accuracy of the supervised model eventually reaches 100% much earlier. If MPL is simply performing label correction, then these accuracy should be high. Instead, we suspect that the teacher in MPL is trying to regularize the student to prevent overfitting. This is the more appropriate behavior on small datasets like CIFAR10 (4,000) and ImageNet (10%).
5.3 MPL is Not Only a Regularization Strategy
In contrast to Section 5.2, one could think that MPL only injects noise to the student’s learning to avoid overfitting. Here, we also negate this hypothesis. There are two ways for the teacher to inject noise to the student’s learning: by flipping the target class, eg. tell the student the an image of a car is an image of a horse; or by dampening the target distribution. We empirically demonstrate that MPL’s teacher follows neither pattern. In Figure 10, we visualize a few target distributions that a teacher model in ReducedMPL predicts for images from the TinyImages dataset. We observe two trends from the figure. First, the label with highest confidence for the images does not change at the quarters of the student’s training process. This means that the teacher has not managed to flip the target labels. Second, the target distributions that the teacher predicts become steeper between 50% and 75%. As the student is learning during this time, if the teacher simply wants to regularize the student, then the teacher should dampen the distributions. Thus, we suspect that MPL is more than a regularization method.
6 Related Work
Synthetic Gradients.
By letting the teacher generate the target distribution for the student model to learn, MPL equivalently lets the teacher determine the student’s gradients. Learning the gradients belongs to a line of work called synthetic gradient (learning_to_learn_sgd). There are two major differences between MPL and synthetic gradient. First, MPL’s gradient is restricted into a more specific subspace. In particular, the gradient in MPL is computed from a crossentropy, while synthetic gradients are computed based on intermediate representations of the student model, which has a much larger range of values. We suspect that such restriction makes the teacher of MPL provide more accurate gradients for the student model. Second, most work on synthetic gradient learn these gradients by regressing against the correct gradient, while MPL metalearns the teacher to generate the student’s gradients. An exception is “Learning Unsupervised Updates” (learning_unsup_rules), where the synthetic gradient is metalearned via an explicit outer loop. Unlike MPL, learning_unsup_rules has an explicit outer loop makes their training prohibitively expensive to scale to large datasets and large models like MPL.
Meta Learning.
MPL shares the same goal with Meta Learning, ie., to establish a positive bias that benefits the learning process of a submodel (signature_and_siamese; siamese_net; mann_net; maml). In MPL, this “bias” manifests via the target distribution of the training data for the student model. Similar to other metalearning algorithms, MPL leverages the Jacobianvector product (vjp) to compute the “gradient of gradient” for MPL’s teacher model ( Equation 2, Section 3.1).
Semisupervised Learning (SSL).
Loosely speaking, SSL methods aim to utilize both labeled data and unlabeled data to train a model. As shown in our experiments (see Section 4), MPL makes use of both labeled and unlabeled data. Selftraining and label propagation, which we discussed in details in Section 2, are SSL algorithms which assign class distributions to unlabeled data to extend the training dataset. In this sense, MPL is an SSL algorithm. However, a significant difference between MPL and other SSL methods is that our teacher model receives learning signals from the student’s performance, and hence can adapt to the student’s learning state throughout the course of the student’s training. In Section 2 we have presented this motivation of MPL, and Section 4 we have empirically justified its benefit.
7 Conclusion
In this paper, we proposed Meta Pseudo Labels (MPL). Key to MPL is the idea that a teacher model can dynamically set the target distribution of training data for the student to improve the student’s learning. Experiments on CIFAR10, SVHN, and ImageNet show that MPL significantly improves its corresponding baselines. Currently, MPL is too memory intensive for us to experiment on train large models and large datasets. However, we also proposed ReducedMPL which significantly reduces MPL’s footprint, allowing us to verify the benefit of MPL’s key idea in large scale experiment. As computational hardware rapidly develop, we believe that MPL will achieve better results.
References
Appendix A Meta Learning Problem
We formally state the meta learning problem as mentioned in Section 3:
[Outer loop]  
[Inner loop] 
We note that we do not directly solve this metalearning problem, as the inner loop is prohibitively expensive to repeat for multiple times to train using gradientbased updates. Instead, MPL develops a stepwise strategy to update and .
Appendix B Generalized Update Rules of the Teacher
We demonstrate how to generalize the update rules of MPL to other training algorithms, such as Momentum (nesterov)
or RMSprop
(rms_prop). First, we revisit the teacher’s MPL objective from Equation 2, which we rewrite below:(4) 
The dependency of the objective on is through the student’s gradient, namely the term boxed in magenta in the equation. Let us define:
(5) 
Then, Equation 4 can be rewritten as:
(6) 
This view allows us to generalize the computation of to arbitrary update rules by setting different forms for . For example, for momentum update, we can simply set:
(7) 
where is the momentum constant, typically set to , and is the momentum vector, which does not depend on . Similarly, for RMSprop, we can set:
(8) 
where and are the momentum and the RMS decay rate, is the momentum and is the moving average of squared gradients. Both and do not depend on .
In practice, to implement MPL, we create a shadow model of the student, whose variables are set to . We compute the gradient of the shadow variables of this shadow model, and then further backpropagate these gradients to .
Appendix C Experiment Details
All our experiments are run on Tensor Processing Units, using slices of size 4x4, 8x8, or 16x16, depending on the experiment.
c.1 Details for Experiments in Section 4.1
Dataset Splits.
For CIFAR10 and SVHN, we download the datasets from their official websites, load them into numpy_arrays, and then select the first 4,000 and 1,000 examples, respectively. For ImageNet, we use the dataset shards preprocessed by inception, which include 1,024 shards, and we take the first 102 shards, corresponding to 10% of all labeled data. This proceduure leads to a slightly imbalanced class distribution, eg. there are not exactly 400 images for each class of CIFAR10 and not exactly 100 images for each class of SVHN. This is not our focus, and we use the same split for all controlled experiments – our baselines and our method MPL. The image resolutions are 32x32 for CIFAR10 and SVHN, and are 224x224 for ImageNet.
Training Details.
Both the teacher model and the student models are trained with Nesterov momentum
(nesterov), with a momentum constant of . We use the cosine learning rate schedule, starting a particular value and decaying to ; the starting learning rate is a hyperparameter. We also apply Dropout (dropout) at the prediction of both the teacher and the student. This means that when the teacher sets the target distribution for the student to learn, there are stochastic regularization.Hyperparameter Tunings.
To select hyperparameters, we reserve 400 labeled examples from the 4,000 labeled examples of CIFAR10 and about 12,800 labeled examples from the 10% of labeled examples in ImageNet. For SVHN, since 1,000 examples are too few, we do not tune hyperparameters; instead, we simply use the hyperparameters found on CIFAR10 for SVHN. We tune hyperparameters using a contextual bandit optimizer, which is implemented by vizier. We allow trials for CIFAR10/SVHN, and allow trials for ImageNet. For hyperparameter tuning, each trial is run for only 100,000 steps. Our tuning procedure is incremental. For example, we first tune hyperparameters for training a supervised model, then when we tune for MPL, we use the found supervised hyperparameters for the student in MPL and only tune the teacher’s hyperparameters. The optimal hyperparameters are presented in Table 3.
Hyperparameter  CIFAR10  SVHN  ImageNet  
Common  Weight decay  0.0005  0.0005  0.0002 
Label smoothing  0.0  0.0  0.1  
Batch normalization decay  0.99  0.99  0.99  
Number of training steps  1,000,000  1,000,000  500,000  
Number of warm up steps  2,000  2,000  1,000  
Student  Learning rate  0.3  0.15  0.8 
Batch size  128  128  2048  
Dropout rate  0.35  0.45  0.1  
Teacher  Learning rate  0.125  0.05  0.5 
Batch size  128  128  2048  
Dropout rate  0.5  0.65  0.1  
UDA  UDA factor  1.0  2.5  16.0 
UDA temperature  0.8  1.25  0.75 
c.2 Details for Experiments in Section 4.2
Training Details.
Since our student models are EfficientNet, namely B0 for CIFAR10 and SVHN and B7 for ImageNet, we simply use their corresponding hyperparameters from efficient_net. Note that this means that our student models are updated with RMSprop, which necessitates the generalized update rules as described in Appendix B.
Our teacher model is a 5layered multilayered perceptron, with ReLU activation. This teacher model takes as input a probability distribution as predicted by our pretrained model and returns a calibrated target distribution for the student to learn. We use the hidden size of 128 for CIFAR10 and SVHN, and a hidden size of 512 for ImageNet. The teacher’s parameters are updated with Adam
(adam), using a learning rate of 0.0001, , , and . We do not need to tune this learning rate; we only try the logrange values, namely 0.1, 0.01, 0.001, and 0.0001, and use the largest learning rate that does not cause the teacher to get NAN values, which is 0.0001. We apply an L2 regularization of to the teacher.