. However, obtaining full labels for a huge dataset is usually a [rgb]0,0,0very costly task. [rgb]0,0,0Hence, more attention is now drawn on deep semi-supervised learning (SSL). In order to utilize unlabeled data, many methods in traditional machine learning have been proposed
, and some of them are successfully adapted [rgb]0,0,0to deep learning. [rgb]0,0,0In addition, some latest techniques like self-training and Generative Adversarial Networks (GANs) [31, 23, 20] have been utilized for deep SSL with promising results. A primary track of recent deep semi-supervised methods [28, 17, 7, 33] can be summarized as consistency-based methods. In [rgb]0,0,0this type of methods, two roles are commonly created[rgb]0,0,0, either explicitly or implicitly: a teacher model and a student model (i.e., a Teacher-Student structure). The teacher guides the student to approximate its performance under perturbations. The perturbations could come from the noise of [rgb]0,0,0the input or the dropout layer , etc. [rgb]0,0,0A consistency constraint is then imposed on the predictions between two roles, [rgb]0,0,0and forces the unlabeled data to meet the smoothness assumption blackof semi-supervised learning.
The teacher in the Teacher-Student structure can be summarized as being generated by an exponential moving average (EMA) of the student model. In the VAT Model  and the Model , the teacher shares the same weights as the student, which is equivalent to [rgb]0,0,0setting the averaging coefficient to zero. The Temporal Model  is similar to Model except that it also applies an EMA to accumulate the historical predictions. The Mean Teacher  applies an EMA to the student to obtain an ensemble teacher. In this work, we show that the two roles in the Teacher-Student structure are tightly coupled and the degree of the coupling increases as the training goes on. This phenomenon leads to a performance bottleneck since a coupled EMA teacher is not sufficient for the student.
To overcome this problem, the knowledge coming from another independent model should help. [rgb]0,0,0Motivated by this observation, we replace the EMA teacher by another student model. The two students start from different initial states and are optimized through individual paths during training. Hence, their weights will not be tightly coupled and each learns its own knowledge. What remains unclear is how to extract and exchange knowledge between the students. Naively, adding a consistency constraint [rgb]0,0,0may lead to the two models collapsing into each other. Thus, we define the stable sample and propose a stabilization constraint blackfor effective knowledge [rgb]0,0,0exchange. Our method improves the performance blacksignificantly on several main SSL benchmarks. blackFig. 1 demonstrates the Teacher-Student structure and our Dual Student structure.
In summary, the main contributions of this work include:
We demonstrate that the coupled EMA teacher causes a performance bottleneck of the existing Teacher-Student methods.
We define the stable samples of a model and propose a novel stabilization constraint between models.
We propose a new SSL structure, Dual Student, and [rgb]0,0,0discuss two variants of Dual Student with higher performances.
Extensive experiments are conducted to evaluate the performance of our method on several benchmarks and in different tasks.
2 Related Work
Consistency-based SSL methods are derived from the network noise regularization . Goodfellow et al.  first showed the advantage of adversarial noise [rgb]0,0,0over random noise. Miyato et al.  further explored this idea for unlabeled data and generated virtual adversarial samples for the implicit teacher, while Park et al.  proposed a virtual adversarial dropout based on . In addition to noise, the quality of targets for the consistency constraint is also vital in this process. Bachman et al.  and Rasmus et al.  showed the effectiveness of regularizing the targets. [rgb]0,0,0Laine et al. then proposed the internally consistent Model and Temporal Model in . Tarvainen took advantage of averaging model weights  to obtain an explicit ensemble teacher  for generating targets. Some works derived from the traditional methods also improve the consistency-based SSL. Smooth Neighbor by Luo et al.  utilized the connection between data points and built a neighbor graph to cluster data more tightly. Athiwaratkun et al.  modified the stochastic weight averaging (SWA)  to [rgb]0,0,0obtain a stronger ensemble teacher faster. Qiao et al.  proposed Deep Co-Training , [rgb]0,0,0by adding a consistency constraint between independent models.
2.2 Teacher-Student Structure
The most common structure of recent SSL methods is the Teacher-Student structure. It applies a consistency constraint between a teacher model and a student model to learn knowledge from unlabeled data. Formally, we assume that a dataset consists of an unlabeled subset and a labeled subset. Let denote [rgb]0,0,0the weights of the teacher, and denote [rgb]0,0,0the weights of the student. The consistency constraint is defined as:
where is the prediction from model for input with noise . is the consistency target from the teacher.
measures the distance between two vectors, [rgb]0,0,0and is usually set to mean squared error (MSE) or KL-divergence. Previous works have proposed several ways to generate.
Model: In Model, the implicit teacher shares parameters with the student. It forwards a sample twice with different random noise and in each iteration, and treats the prediction of as .
Temporal Model: While
Model needs to forward a sample twice in each iteration, [rgb]0,0,0Temporal Model reduces this computational overhead by using EMA to accumulate the predictions over epochs as
. This approach could reduce the [rgb]0,0,0prediction variance and stabilize the training process.
Mean Teacher: Temporal Model needs to store a record for each sample[rgb]0,0,0, and the target gets updated only once per epoch while the student is updated multiple times. [rgb]0,0,0Hence, Mean Teacher defines an explicit teacher by an EMA of the student and update its weights in each iteration [rgb]0,0,0before generating .
VAT Model: Although random noise is effective in previous methods, VAT Model adopts the adversarial noise [rgb]0,0,0to generate better for the consistency constraint.
2.3 Deep Co-Training
It is known that fusing knowledge from multiple models could improve performance [rgb]0,0,0in SSL . However, directly adding the consistency constraint between models results in [rgb]0,0,0the models collapsing into each other. Deep Co-Training addressed this issue [rgb]0,0,0by utilizing the Co-Training assumption from the traditional Co-Training algorithm . It treats the features from the convolutional layers as a view of the input and uses the adversarial samples from other collaborators to ensure [rgb]0,0,0that view differences exist among the models. [rgb]0,0,0Consistent predictions can then be used for training. However, this strategy requires [rgb]0,0,0generating adversarial samples of each model in the whole process, which is complicated and time-consuming.
Our method also has interactions between models to break the limits of the EMA teacher, but there are two major differences between our method and Deep Co-Training. First, instead of [rgb]0,0,0enforcing the consistency constraint blackand the different-views constraint, we only extract reliable knowledge of the models and exchange them by a more effective stabilization constraint. Second, our method is more efficient since we do not need the adversarial samples.
3 Limits of the EMA Teacher
One fundamental assumption in SSL is the smoothness assumption - If two data points in a high-density region are close, then so should be the corresponding outputs . All existing Teacher-Student methods utilize unlabeled data according to this assumption. In practice, if and are generated from a sample with different small perturbations, they should have consistent predictions by the corresponding teacher and student. Previous methods achieve this by the consistency constraint and have mainly focused on generating more meaningful targets through ensemble or well-designed noise.
However, previous works neglect that the teacher is essentially an EMA of the student[rgb]0,0,0. Hence, their weights are tightly coupled. Formally, the teacher weights are an ensemble of the student weights in a successive training step with a smoothing coefficient :
[rgb]0,0,0In Model and VAT Model, as is set to zero, is equal to . Temporal Model improves Model by an EMA on historical predictions, but its teacher still shares weights with the student. As for Mean Teacher, the updates of the student weights decreases as the model converges, i.e., becomes smaller and smaller as [rgb]0,0,0the number of training steps increases. Theoretically, it can be proved that the EMA of a converging sequence converges to the same limit as the sequence, blackwhich is shown in Appendix A (Supplementary). Thus, the teacher will be very close to the student when the training process converges. In all the above cases, the coupling fact between the teacher and the student is obvious.
To further visualize it, we train two structures on the CIFAR-10 SSL benchmark. One contains a student and an EMA teacher (named ) while the other contains two independent models (named ). [rgb]0,0,0We then calculate the Euclidean distance of [rgb]0,0,0the weights and predictions between the two models in each structure. [rgb]0,0,0Fig. 2 shows the results. As expected, the EMA teacher in is very close to the student, and their distance approaches zero with increasing epochs. In contrast, the two models in always keep a larger distance from each other. These results confirm our conjecture that the EMA teacher is tightly coupled with the student. In addition, they also demonstrate that the two independent models are loosely coupled.
[rgb]0,0,0Due to the coupling effect between the two roles in the existing Teacher-Student methods, the teacher does not have more meaningful knowledge compared to the student. [rgb]0,0,0In addition, if the student has [rgb]0,0,0biased predictions for specific samples, the EMA teacher is most likely to accumulate the mistakes and to enforce the student to follow, making the misclassification irreversible. [rgb]0,0,0This is a case of the confirmation bias . Most methods apply a ramp-up operation for the consistency constraint to alleviate the bias, but it is inadequate to solve the problem. From this perspective, training independent models are also beneficial. [rgb]0,0,0Fig. 3 visualizes this inability of the EMA teacher. Three models, , , and , are trained on a two-category task simultaneously. is the student from Mean Teacher. and are two relatively independent but interactive models, [rgb]0,0,0representing the two students from our Dual Student structure ([rgb]0,0,0Section 4). [rgb]0,0,0They have the same initialization, while is different from them. The plot shows how the predictions of a sample from class1 changes with epochs for these three models, blackwhich demonstrates that our method can alleviate the confirmation bias.
4 Dual Student
As analyzed above, the targets from an EMA teacher are not adequate to guide the student when [rgb]0,0,0the number of training steps is large. Therefore, our method gains the loosely coupled targets by training two independent models simultaneously. However, [rgb]0,0,0the outputs of these two models may vary widely, and applying the consistency constraint directly will [rgb]0,0,0cause them to collapse into each other by exchanging the wrong knowledge. The EMA teacher does not suffer from this problem blackdue to the coupling effect.
We propose an efficient way to overcome this problem, which is to [rgb]0,0,0exchange only reliable knowledge of [rgb]0,0,0the models. To put this idea into practice, [rgb]0,0,0we need to solve two problems. One is how to define and acquire reliable knowledge of a model. Another is how to exchange the knowledge mutually. To address them, we define the stable sample in Section 4.1 and then elaborate the derived stabilization constraint for training in Section 4.2.
4.1 Stable Sample
A model can be regarded as a decision function that can make reliable predictions for some samples but not for the others. We define the stable sample and treat it as the reliable knowledge of a model. A stable sample satisfies two conditions. First, according to the smoothness assumption
, a small perturbation should not affect the prediction of this sample, i.e., the model should be smooth in the neighborhood of this sample. [rgb]0,0,0Second, the prediction of this sample [rgb]0,0,0is far from the decision boundary[rgb]0,0,0. This means that this sample has a high probability for the predicted label.
Definition 4.1 (Stable sample).
Given a constant , a dataset that satisfies the smoothness assumption and a model that satisfies for all , is a stable sample with respect to if:
near , their predicted labels are the same.
satisfies the inequality: . 111, ,
[rgb]0,0,0Def. 4.1 defines the stable sample, and Fig. 5 illustrates its conditions in details. Notice that the concept of the stable sample is specific to [rgb]0,0,0the models. A data point can be stable with respect to [rgb]0,0,0any one model but may not [rgb]0,0,0be to the others. This fact is a key to our stabilization constraint, [rgb]0,0,0and will be elaborated in Section 4.2. [rgb]0,0,0In addition to the criterion of whether a sample point is stable or not, we would [rgb]0,0,0also like to know the degree of stability of a stable sample . This can be reflected by [rgb]0,0,0the prediction consistency in its neighborhood. The more consistent the predictions are, the more stable is.
4.2 Training by [rgb]0,0,0the Stabilization Constraint
We briefly introduce Dual Student structure before explaining the details on training. [rgb]0,0,0It contains two independent student models, which share the same network architecture with different initial states and are updated separately (Fig. 4). [rgb]0,0,0For our structure to be trainable, we derive a novel stabilization constraint from the stable sample.
In practice, we only utilize two close samples to approximate the conditions of the stable sample to reduce [rgb]0,0,0the computational overhead. Formally, we use [rgb]0,0,0and to represent weights of the two students. [rgb]0,0,0We first define a boolean function , which outputs 1 when the condition is true and 0 otherwise. Suppose is a noisy augmentation of a sample . [rgb]0,0,0We then check whether is a stable sample for student :
[rgb]0,0,0and are the predicted labels of [rgb]0,0,0and , respectively, by studentis a confidence threshold in . [rgb]0,0,0If the maximum prediction probability of sample exceeds , is considered to be far enough from the classification boundary. [rgb]0,0,0We then use the Euclidean distance to measure the prediction consistency[rgb]0,0,0, to [rgb]0,0,0indicate the stability of [rgb]0,0,0, [rgb]0,0,0as:
[rgb]0,0,0A smaller means that is more stable to student . [rgb]0,0,0The distance between the predictions of students and can be measured using the mean squared error (MSE) as:
Finally, the stabilization constraint for the student on sample is written as:
blackWe calculate the stabilization constraint for the student in the same way. As we can see, the stabilization constraint changes dynamically depending on the outputs of the two students. There are three cases: (1) No constraint is applied if is unstable for both students. (2) If is stable only for student , it [rgb]0,0,0can guide student . (3) [rgb]0,0,0If is stable for both students, the stability is calculated, and the constraint is applied from the more stable one to the other.
[rgb]0,0,0Following previous works, our Dual Student structure also imposes the consistency constraint [rgb]0,0,0in each student to meet the smoothness assumption. [rgb]0,0,0We also apply the decoupled top layers trick from the Mean Teacher, which splits the constraints for the classification and the smoothness.
To train Dual Student, the final constraint for student is a combination of three parts[rgb]0,0,0: the classification constraint, consistency constraint blackin each model, and stabilization constraint blackbetween models, as:
where and are hyperparameters to balance the constraints. [rgb]0,0,0Algorithm 1 summarizes the optimization process.
4.3 Variants of Dual Student
[rgb]0,0,0Here, we briefly discuss two variants of Dual Student, named Multiple Student and Imbalanced Student. Both of them have higher [rgb]0,0,0performances than the standard Dual Student. They do not increase the inference time, even though more computations are required during training.
Multiple Student: Our Dual Student [rgb]0,0,0can be easily extended to Multiple Student. We followed the same strategy as the Deep Co-Training. [rgb]0,0,0We assume that our Multiple Student contains student models. At each iteration, we randomly divide these students into pairs. [rgb]0,0,0Each pair is then updated like Dual Student. [rgb]0,0,0Since our method does not require models to have view differences, the data stream can be shared among [rgb]0,0,0the students. This is different from Deep Co-Training, which requires an exclusive data stream for each pair. In practice, four students [rgb]0,0,0() achieve a notable improvement over two students. However, [rgb]0,0,0having more than four students do not further improve the performance, as demonstrated in Section 5.2.
Imbalanced Student: [rgb]0,0,0Since a well-designed architecture with more parameters usually has better performance, a pre-trained high-performance teacher can be used to improve the light-weight student in knowledge distillation task [9, 10]. blackBased on the same idea, we extend blackDual Student to Imbalanced Student by enhancing the capability of one student. blackHowever, we do not consider the sophisticated model as a teacher, since the knowledge will be exchanged mutually. We [rgb]0,0,0find that the improvement of the weak student is proportional to the capability of the [rgb]0,0,0strong student.
We first evaluate Dual Student on several common SSL benchmarks, including CIFAR, SVHN, and ImageNet. [rgb]0,0,0We then evaluate the performances of the two variants of Dual Student. We [rgb]0,0,0further analyze various aspects of our method [rgb]0,0,0through ablation experiments. Finally, we [rgb]0,0,0demonstrate the application of Dual Student in a domain adaptation task.
[rgb]0,0,0Unless specified otherwise, the architecture used in our experiments is [rgb]0,0,0a same 13-layer convolutional neural network black(CNN), following previous [rgb]0,0,0works[17, 21, 33]. Its details are described in blackAppendix B (Supplementary). As reported in , the implementations of recent SSL methods are not exactly same, and the training details ([rgb]0,0,0e.g., number of training epochs, optimizer and augmentation) [rgb]0,0,0may also be different. For a fair comparison, we implement our method following the previous state-of-the-art , which uses the standard Batch Norm  instead of the mean-only Batch Norm 
. The stochastic gradient descent optimizer is adopted with the learning rate adjustment function, where is the current training step, is the total [rgb]0,0,0number of steps, and is the initial learning rate. These [rgb]0,0,0settings provide better baselines for Model and Mean Teacher. For other methods, we use the results from the original papers. More training details are provided in blackAppendix C (Supplementary).
5.1 SSL Benchmarks
[rgb]0,0,0We first evaluate Dual Student on the CIFAR benchmark, including CIFAR-10  and CIFAR-100 . CIFAR-10 has 50k training samples and 10k testing samples, [rgb]0,0,0from 10 categories. Each sample is a RGB image. We extract 1k, 2k, and 4k balanced labels randomly. CIFAR-100  is a more complex dataset [rgb]0,0,0including 100 categories. Each [rgb]0,0,0category contains only 500 training samples[rgb]0,0,0, together with 100 test samples. We extract 10k balanced labels from it randomly. Besides, we also run experiments with full labels on both datasets. We compare our Dual Student (DS) with some recent consistency-based models, including Model (), Temporal Model (Temp), Mean Teacher (MT), Smooth Neighbor (SN), FastSWA based on Mean Teacher (MT+FSWA), and Deep Co-Training (Deep CT). We also replace the stabilization constraint in our structure with the consistency constraint (CS) as a baseline.
|Model||10k labels||all labels|
|+ FSWA |
|MT + FSWA |
|MT + FSWA (1200) |
|Deep CT (600) ||-|
[rgb]0,0,0Table 1 shows the results on CIFAR-10. All models are trained for 300 epochs, except for those specified with parentheses. Results marked with a are obtained from other works that published better performances than the original ones. We can see that our Dual Student boosts the performance on all semi-supervised settings. The results reveal that as the number of labeled samples decreases, our method can gain more significant improvements. Specifically, Dual Student improves the result with 1k labels to with only [rgb]0,0,0half of training epochs comparing to blackFastSWA. Similar results [rgb]0,0,0can also be observed in the experiments with 2k and 4k labels. Fig. 6 shows that the accuracy on only the stable samples is higher than that on all samples, which proves that the stable samples represent the relatively more reliable knowledge of a model. This justifies why our DS with stabilization constraint achieves much better results than the CS. Our result on full labels shows less advantages since the labels play a much more important role in the fully supervised case. Table 2 lists the results on CIFAR-100. Especially, in 10k label experiments, Dual Student records a new state-of-the-art with less training epochs than FastSWA and Deep Co-Training.
|Model||250 labels||500 labels|
|Model||10% labels-top1||10% labels-top5|
To evaluate the generalization ability of Dual Student, we also conduct experiments [rgb]0,0,0on both SVHN  and ImageNet . Street View House Numbers (SVHN) is a dataset containing 73,257 training samples and 26,032 testing samples. Each sample is a RGB image with a center close-up of [rgb]0,0,0a house number. We only experiment with 250 and 500 labels on SVHN. ImageNet contains more than 10 million RGB images belonging to 1k categories. We extract 10% balanced labels and train a 50-layer ResNeXt model . [rgb]0,0,0Tables 3 and 4 show that Dual Student could improve the results on [rgb]0,0,0these datasets of various scales.
|MS (4 models)|
|MS (8 models)|
|IS (3.53M params)|
|IS (11.6M params)|
5.2 Performance of Variants
We evaluate Multiple Student and Imbalanced Student on the CIFAR benchmark. Table 5 compares them with the standard Dual Student[rgb]0,0,0, all using the same 13-layer CNN trained for 300 epochs. For Multiple Student (MS), we train both the four students and the eight students. The performance improvement is limited when more than four students are trained simultaneously. For Imbalanced Student (IS), we replace one student by a ResNet  with Shake-Shake regularization. [rgb]0,0,0We then conduct the experiments on two different model sizes. In particular, a small one with 3.53 million parameters and a large one with 11.65 million parameters. The small ResNet has almost no increase in [rgb]0,0,0computational cost, as its number of parameters is similar to that of the 13-layer CNN (3.13 million parameters). Imbalanced Student achieves a significant performance improvement by distilling the knowledge from a more powerful student. Notably, the large ResNet improves the result from 15.74% to 12.39% on CIFAR-10 with 1k labels.
[rgb]0,0,0Our structure can also be combined with existing methods easily to [rgb]0,0,0further improve the performance. We replace the consistency constraint inside the model by Mean Teacher. Fig. 7 (left) [rgb]0,0,0shows the accuracy curves. The obvious [rgb]0,0,0performance improvement shows the ability of Dual Student in breaking the limits of the EMA teacher. The accuracy of the combination is similar to that using Dual Student only, which means that our method is insensitive to the type of consistency constraint inside each model.
5.3 Ablation Experiments
We conduct [rgb]0,0,0the ablation experiments on CIFAR-10 with 1k labels to analyze the impact of the confidence threshold and [rgb]0,0,0various constraints in our structure.
Confidence threshold: black The confidence threshold controls the 2nd condition in Def. 4.1 [rgb]0,0,0of the stable sample by filtering out samples near to the boundary. Its [rgb]0,0,0actual value can be set approximately, since our method is robust to it. Typically, is related to the complexity of the task, e.g., the number of categories to predict or the size of the given dataset. More categories or [rgb]0,0,0a smaller size would require a smaller . Table 6 compares [rgb]0,0,0different values on the CIFAR benchmark. The results [rgb]0,0,0show that is necessary for a better performance[rgb]0,0,0, and a meticulous tuning may only help improve the performance slightly.
Effect of [rgb]0,0,0the constraints: Dual Student learns the unlabeled data by both between models and the inside each model. We also study [rgb]0,0,0their individual impacts. Besides, we compare the results with the experiment where only the consistency constraint is applied between models (named ). Fig. 7 (right) shows that [rgb]0,0,0reduces the accuracy in the late stage while [rgb]0,0,0helps improve the performance continuously. This demonstrates that our [rgb]0,0,0is better than . [rgb]0,0,0In addition, blackinside the model also plays a role in boosting the performance further.
5.4 Domain Adaptation
Domain adaptation aims to transfer knowledge learned from a labeled dataset to an unlabeled one. French et al.  modified Mean Teacher and Temporal Model to enable domain adaptation and showed the effectiveness of the Teacher-Student structure. blackIn this section, we [rgb]0,0,0apply Dual Student for adapting the digit recognition model from USPS to MNIST and show that it could be applied to this [rgb]0,0,0kind of task with great advantages over the EMA teacher [rgb]0,0,0based methods.
Both USPS and MNIST are greyscale hand-written [rgb]0,0,0number dataset. USPS consists of 7,000 images of , and MNIST contains 60,000 images of . To match the image resolution, we resize all images from USPS to
by cubic spline interpolation. Fig.8 shows the domain difference between the two datasets. In our experiments, we set USPS as the source domain and MNIST as the target domain. We compare our method with Mean Teacher, [rgb]0,0,0source domain (USPS) supervised model, and target domain (MNIST) supervised model (trained [rgb]0,0,0on 7k balanced labels). All experiments use a small architecture simplified from the above 13-layer CNN. More details [rgb]0,0,0are available in blackAppendix D (Supplementary).
[rgb]0,0,0Fig. 9 shows the test accuracy versus the number of epochs. We can see that naively using supervision from USPS would result in overfitting. Mean Teacher avoids it to some extent and improves the top1 accuracy from 69.09% to 80.41%[rgb]0,0,0, but it overfits when the [rgb]0,0,0number of training epochs is large. Our Dual Student avoids overfitting and [rgb]0,0,0boosts the accuracy to 91.50%, which is much closer to the result obtained by supervision from the target domain.
In this paper, we have studied the coupling effect of the existing Teacher-Student methods and [rgb]0,0,0shown that it [rgb]0,0,0sets a performance bottleneck [rgb]0,0,0for the structure. We [rgb]0,0,0have proposed a new structure, Dual Student, to break limits of the EMA teacher[rgb]0,0,0, and a novel stabilization constraint, which provides an effective way to train independent models (either with the same architecture or not). blackThe stabilization constraint is bidirectional overall but is unidirectional for each stable sample. The improved performance is notable across datasets and tasks. Besides, [rgb]0,0,0we have also discussed two variants of Dual Student, with even better results. blackHowever, our method still shares similar limitations as existing methods, [rgb]0,0,0e.g., increased memory usage during training and performance degradation on increasing number of labels. In the future, we [rgb]0,0,0plan blackto address these [rgb]0,0,0issues and extend our structure to other applications.
Appendix A: Convergence of the EMA
In our paper, we state that the EMA teacher is coupled with the student in the existing Teacher-Student methods. We provide below a formal proposition for this statement and a simple proof.
Given a sequence and let , where , , . If converges to , then converges to as well.
By the definition of convergence, if converges to , we have: , such that , . First, when , by the formula of the sum of a finite geometric series, we rewrite and as:
Since is finite, and are bounded. Thus, such that:
Since , we have . Thus, such that . Then, after substituting Eq. 8 into and applying the Triangular Inequality, we have:
Then , we have:
Appendix B: Model Architectures
The model architecture used in our CIFAR-10, CIFAR-100, and SVHN experiments is the 13-layer convolutional network (13-layer CNN), which is the same as previous works [33, 17, 1, 19, 27]. We implement it following FastSWA  for comparison. Table 7 describes its architecture in details. For ImageNet experiments, we use a 50-layer ResNeXt  architecture, which includes 3+4+6+3 residual blocks and uses the group convolution with 32 groups.
Appendix C: Semi-supervised Learning Setups
In our work, all experiments use the SGD optimizer with the nesterov momentum set to. The learning rate is adjusted by the function , where is the current training step, is the total number of steps, and is the initial learning rate. We present the settings of the experiments on each dataset as follows.
CIFAR-10: On CIFAR-10, we set the batch size to 100 and half of the samples in each batch are labeled. The initial learning rate is . The weight decay is . For the stabilization constraint, we set its coefficient and ramp it up in the first 5 epochs. We set . The confidence threshold for the stable samples is .
CIFAR-100: On CIFAR-100, each minibatch contains 128 samples, including 31 labeled samples. We set the initial learning rate to and the weight decay to . The confidence threshold is . Other hyperparameters are the same as CIFAR-10.
|augmentation||random translation, horizontal flip|
, pad =same, LReLU =
|convolution||, , pad = same, LReLU =|
|convolution||, , pad = same, LReLU =|
|pooling||, type = maxpool|
|, , pad = same, LReLU =|
|convolution||, , pad = same, LReLU =|
|convolution||, , pad = same, LReLU =|
|pooling||, type = maxpool|
|, , pad = valid, LReLU =|
|convolution||, , LReLU =|
|convolution||, , LReLU =|
|pooling||, type = avgpool|
|augmentation||gaussian noise =|
|convolution||, , pad = same, LReLU =|
|pooling||, type = maxpool|
|, , pad = same, LReLU =|
|pooling||, type = maxpool|
|, , pad = same, LReLU =|
|pooling||, type = avgpool|
SVHN: The batch size on SVHN is 100, and each minibatch contains only 10 labeled samples. The initial learning rate is , and the weight decay is . The stabilization constraint is scaled by (ramp up in 5 epochs). We use the confidence threshold .
ImageNet: We validate our method on ImageNet by the ResNeXt-50 architecture on 8 GPUs with batch size and half of the batch are labeled samples. Each sample is augmented following  and is resized to . We warm-up the learning rate from to in the first epochs. The model is trained for epochs with the weight decay set to , the stabilization constraint coefficient set to , and a small confidence threshold of .
Appendix D: Domain Adaptation Setups
We design a small convolutional network for the domain adaptation from USPS (source domain) to MNIST (target domain). The structure is shown in Table 8. We train all experiments for 100 epochs by the SGD optimizer with the nesterov momentum set to and the weight decay set to . The learning rate declines from to by a cosine adjustment. Each batch includes 256 samples while 32 of them are labeled. We randomly extract 7000 balanced samples from MNIST for target-supervised experiments, and other experiments are done by using the training set of USPS. The coefficient of the stabilization constraint is . We also ramp it up in the first 5 epochs. The confidence threshold is . We discover that the input noise with is vital for the Mean Teacher but not for our method in this experiment.
-  (2019) There are many consistent explanations of unlabeled data: why you should average. In Proc ICLR, Cited by: §2.1, Table 1, Table 2, §5, Appendix B: Model Architectures.
-  (2014) Learning with pseudo-ensembles. In Proc. NIPS, Cited by: §2.1.
Combining labeled and unlabeled data with co-training.
Proc. Annual Conference on Computational Learning Theory, Cited by: §2.3.
-  (2006) Semi-supervised learning. The MIT Press. Cited by: §3.
-  (2018) Tri-net for semi-supervised deep learning. In Proc. IJCAI, Cited by: §1.
-  (2018) Self-ensembling for domain adaptation. In Proc. ICLR, Cited by: §5.4.
-  (2015) Explaining and harnessing adversarial examples. In Proc. ICLR, Cited by: §1, §2.1.
-  (2016) Deep residual learning for image recognition. Proc. CVPR. Cited by: §5.2.
-  (2014) Distilling the knowledge in a neural network. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §4.3.
-  (2017) Efficient knowledge distillation from an ensemble of teachers. In Annual Conference of the International Speech Communication Association, Cited by: §4.3.
-  (2017) Squeeze-and-excitation networks. arXiv:1709.01507. Cited by: Appendix C: Semi-supervised Learning Setups.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. ICML, Cited by: §5.
-  (2017) Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In Proc. NIPS, Cited by: §5.
-  (2018) Averaging weights leads to wider optima and better generalization. In Proc. UAI, Cited by: §2.1.
-  () CIFAR-10 (canadian institute for advanced research). . Cited by: §5.1.
-  () CIFAR-100 (canadian institute for advanced research). . Cited by: §5.1.
-  (2017) Temporal ensembling for semi-supervised learning. In Proc. ICLR, Cited by: §1, §1, §2.1, Table 1, Table 2, §5, Appendix B: Model Architectures.
-  (2017) WebVision database: visual learning and understanding from web data. External Links: Cited by: §1.
-  (2018) Smooth neighbors on teacher graphs for semi-supervised learning. In Proc. CVPR, Cited by: §2.1, Table 1, Appendix B: Model Architectures.
-  (2016) Auxiliary deep generative models. In Proc. ICML, Cited by: §1.
-  (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE TPAMI. Cited by: §1, §2.1, §5.
-  (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §5.1.
-  (2016) Semi-supervised learning with generative adversarial networks. In Data Efficient Machine Learning workshop at ICML, Cited by: §1.
-  (2018) Realistic evaluation of semi-supervised learning algorithms. In Proc. NeurIPS, Cited by: §5.
-  (2018) Adversarial dropout for supervised and semi-supervised learning. In Proc. AAAI, Cited by: §2.1.
-  (1992) Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization. Cited by: §2.1.
-  (2018) Deep co-training for semi-supervised image recognition. In Proc. ECCV, Cited by: §2.1, Table 1, Table 2, Appendix B: Model Architectures.
-  (2015) Semi-supervised learning with ladder networks. In Proc. NIPS, Cited by: §1, §2.1.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §1, §5.1.
-  (1991) Creating artificial neural networks that generalize. Neural Networks. Cited by: §2.1.
-  (2015) Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv:1511.06390. Cited by: §1.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. Cited by: §1, §2.1.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Proc. NIPS, Cited by: §1, §1, §2.1, §3, Table 1, Table 2, Table 3, Table 4, §5, Appendix B: Model Architectures.
-  (2017) Aggregated residual transformations for deep neural networks. In Proc. CVPR, Cited by: §5.1, Appendix B: Model Architectures.
-  (2011) When semi-supervised learning meets ensemble learning. In Frontiers of Electrical and Electronic Engineering in China, Cited by: §2.3.
-  (2006) Semi-supervised learning literature survey. TR 1530, University of Wisconsin, Madison. Cited by: §1.