Dual Student: Breaking the Limits of the Teacher in Semi-supervised Learning

09/03/2019 ∙ by Zhanghan Ke, et al. ∙ Microsoft 0

Recently, consistency-based methods have achieved state-of-the-art results in semi-supervised learning (SSL). These methods always involve two roles, an explicit or implicit teacher model and a student model, and penalize predictions under different perturbations by a consistency constraint. However, the weights of these two roles are tightly coupled since the teacher is essentially an exponential moving average (EMA) of the student. In this work, we show that the coupled EMA teacher causes a performance bottleneck. To address this problem, we introduce Dual Student, which replaces the teacher with another student. We also define a novel concept, stable sample, following which a stabilization constraint is designed for our structure to be trainable. Further, we discuss two variants of our method, which produce even higher performance. Extensive experiments show that our method improves the classification performance significantly on several main SSL benchmarks. Specifically, it reduces the error rate of the 13-layer CNN from 16.84 12.39 10k labels. In addition, our method also achieves a clear improvement in domain adaptation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep supervised learning has gained significant success in computer vision tasks, which leads the community to challenge larger and more complicated datasets like ImageNet

[29] and WebVision [18]

. However, obtaining full labels for a huge dataset is usually a [rgb]0,0,0very costly task. [rgb]0,0,0Hence, more attention is now drawn on deep semi-supervised learning (SSL). In order to utilize unlabeled data, many methods in traditional machine learning have been proposed 

[36]

, and some of them are successfully adapted [rgb]0,0,0to deep learning. [rgb]0,0,0In addition, some latest techniques like self-training 

[5] and Generative Adversarial Networks (GANs) [31, 23, 20] have been utilized for deep SSL with promising results. A primary track of recent deep semi-supervised methods [28, 17, 7, 33] can be summarized as consistency-based methods. In [rgb]0,0,0this type of methods, two roles are commonly created[rgb]0,0,0, either explicitly or implicitly: a teacher model and a student model (i.e., a Teacher-Student structure). The teacher guides the student to approximate its performance under perturbations. The perturbations could come from the noise of [rgb]0,0,0the input or the dropout layer  [32], etc. [rgb]0,0,0A consistency constraint is then imposed on the predictions between two roles, [rgb]0,0,0and forces the unlabeled data to meet the smoothness assumption blackof semi-supervised learning.

Figure 1: [rgb]0,0,0Teacher-Student versus Dual Student. The teacher (T) in Teacher-Student is an EMA of the student (S), imposing a consistency constraint on the student. Their weights are tightly coupled. In contrast, a bidirectional stabilization constraint is applied between the two students (S and S’) in Dual Student. Their weights are loosely coupled.

The teacher in the Teacher-Student structure can be summarized as being generated by an exponential moving average (EMA) of the student model. In the VAT Model [21] and the Model [17], the teacher shares the same weights as the student, which is equivalent to [rgb]0,0,0setting the averaging coefficient to zero. The Temporal Model [17] is similar to Model except that it also applies an EMA to accumulate the historical predictions. The Mean Teacher [33] applies an EMA to the student to obtain an ensemble teacher. In this work, we show that the two roles in the Teacher-Student structure are tightly coupled and the degree of the coupling increases as the training goes on. This phenomenon leads to a performance bottleneck since a coupled EMA teacher is not sufficient for the student.

To overcome this problem, the knowledge coming from another independent model should help. [rgb]0,0,0Motivated by this observation, we replace the EMA teacher by another student model. The two students start from different initial states and are optimized through individual paths during training. Hence, their weights will not be tightly coupled and each learns its own knowledge. What remains unclear is how to extract and exchange knowledge between the students. Naively, adding a consistency constraint [rgb]0,0,0may lead to the two models collapsing into each other. Thus, we define the stable sample and propose a stabilization constraint blackfor effective knowledge [rgb]0,0,0exchange. Our method improves the performance blacksignificantly on several main SSL benchmarks. blackFig. 1 demonstrates the Teacher-Student structure and our Dual Student structure.

In summary, the main contributions of this work include:

  • [itemsep=-2pt]

  • We demonstrate that the coupled EMA teacher causes a performance bottleneck of the existing Teacher-Student methods.

  • We define the stable samples of a model and propose a novel stabilization constraint between models.

  • We propose a new SSL structure, Dual Student, and [rgb]0,0,0discuss two variants of Dual Student with higher performances.

  • Extensive experiments are conducted to evaluate the performance of our method on several benchmarks and in different tasks.

2 Related Work

2.1 Overview

Consistency-based SSL methods are derived from the network noise regularization [30]. Goodfellow et al. [7] first showed the advantage of adversarial noise [rgb]0,0,0over random noise. Miyato et al. [21] further explored this idea for unlabeled data and generated virtual adversarial samples for the implicit teacher, while Park et al. [25] proposed a virtual adversarial dropout based on [32]. In addition to noise, the quality of targets for the consistency constraint is also vital in this process. Bachman et al. [2] and Rasmus et al. [28] showed the effectiveness of regularizing the targets. [rgb]0,0,0Laine et al. then proposed the internally consistent Model and Temporal Model in [17]. Tarvainen took advantage of averaging model weights [26] to obtain an explicit ensemble teacher [33] for generating targets. Some works derived from the traditional methods also improve the consistency-based SSL. Smooth Neighbor by Luo et al. [19] utilized the connection between data points and built a neighbor graph to cluster data more tightly. Athiwaratkun et al. [1] modified the stochastic weight averaging (SWA) [14] to [rgb]0,0,0obtain a stronger ensemble teacher faster. Qiao et al. [27] proposed Deep Co-Training [27], [rgb]0,0,0by adding a consistency constraint between independent models.

2.2 Teacher-Student Structure

The most common structure of recent SSL methods is the Teacher-Student structure. It applies a consistency constraint between a teacher model and a student model to learn knowledge from unlabeled data. Formally, we assume that a dataset consists of an unlabeled subset and a labeled subset. Let denote [rgb]0,0,0the weights of the teacher, and denote [rgb]0,0,0the weights of the student. The consistency constraint is defined as:

(1)

where is the prediction from model for input with noise . is the consistency target from the teacher.

measures the distance between two vectors, [rgb]0,0,0and is usually set to mean squared error (MSE) or KL-divergence. Previous works have proposed several ways to generate

.

Model: In Model, the implicit teacher shares parameters with the student. It forwards a sample twice with different random noise and in each iteration, and treats the prediction of as .

Temporal Model: While

Model needs to forward a sample twice in each iteration, [rgb]0,0,0Temporal Model reduces this computational overhead by using EMA to accumulate the predictions over epochs as

. This approach could reduce the [rgb]0,0,0prediction variance and stabilize the training process.

Mean Teacher: Temporal Model needs to store a record for each sample[rgb]0,0,0, and the target gets updated only once per epoch while the student is updated multiple times. [rgb]0,0,0Hence, Mean Teacher defines an explicit teacher by an EMA of the student and update its weights in each iteration [rgb]0,0,0before generating .

VAT Model: Although random noise is effective in previous methods, VAT Model adopts the adversarial noise [rgb]0,0,0to generate better for the consistency constraint.

2.3 Deep Co-Training

It is known that fusing knowledge from multiple models could improve performance [rgb]0,0,0in SSL [35]. However, directly adding the consistency constraint between models results in [rgb]0,0,0the models collapsing into each other. Deep Co-Training addressed this issue [rgb]0,0,0by utilizing the Co-Training assumption from the traditional Co-Training algorithm [3]. It treats the features from the convolutional layers as a view of the input and uses the adversarial samples from other collaborators to ensure [rgb]0,0,0that view differences exist among the models. [rgb]0,0,0Consistent predictions can then be used for training. However, this strategy requires [rgb]0,0,0generating adversarial samples of each model in the whole process, which is complicated and time-consuming.

Our method also has interactions between models to break the limits of the EMA teacher, but there are two major differences between our method and Deep Co-Training. First, instead of [rgb]0,0,0enforcing the consistency constraint blackand the different-views constraint, we only extract reliable knowledge of the models and exchange them by a more effective stabilization constraint. Second, our method is more efficient since we do not need the adversarial samples.

3 Limits of the EMA Teacher

One fundamental assumption in SSL is the smoothness assumption - If two data points in a high-density region are close, then so should be the corresponding outputs [4]. All existing Teacher-Student methods utilize unlabeled data according to this assumption. In practice, if and are generated from a sample with different small perturbations, they should have consistent predictions by the corresponding teacher and student. Previous methods achieve this by the consistency constraint and have mainly focused on generating more meaningful targets through ensemble or well-designed noise.

However, previous works neglect that the teacher is essentially an EMA of the student[rgb]0,0,0. Hence, their weights are tightly coupled. Formally, the teacher weights are an ensemble of the student weights in a successive training step with a smoothing coefficient :

(2)

[rgb]0,0,0In Model and VAT Model, as is set to zero, is equal to . Temporal Model improves Model by an EMA on historical predictions, but its teacher still shares weights with the student. As for Mean Teacher, the updates of the student weights decreases as the model converges, i.e., becomes smaller and smaller as [rgb]0,0,0the number of training steps increases. Theoretically, it can be proved that the EMA of a converging sequence converges to the same limit as the sequence, blackwhich is shown in Appendix A (Supplementary). Thus, the teacher will be very close to the student when the training process converges. In all the above cases, the coupling fact between the teacher and the student is obvious.

To further visualize it, we train two structures on the CIFAR-10 SSL benchmark. One contains a student and an EMA teacher (named ) while the other contains two independent models (named ). [rgb]0,0,0We then calculate the Euclidean distance of [rgb]0,0,0the weights and predictions between the two models in each structure. [rgb]0,0,0Fig. 2 shows the results. As expected, the EMA teacher in is very close to the student, and their distance approaches zero with increasing epochs. In contrast, the two models in always keep a larger distance from each other. These results confirm our conjecture that the EMA teacher is tightly coupled with the student. In addition, they also demonstrate that the two independent models are loosely coupled.

Figure 2: Left: contains two models with similar weights, while [rgb]0,0,0the weights of the two models in keep a certain distance. Right: [rgb]0,0,0The predictions of the two models in keep a larger distance than those of .
Figure 3: blackOur method can alleviate the confirmation bias. and are the independent students from our Dual Student, while is the student guided by the Mean Teacher. For a misclassified sample ([rgb]0,0,0belonging to class1), can correct it quickly with the knowledge from . blackHowever, is unable to correct its prediction due to the wrong guidance from the EMA teacher.
Figure 4: Dual Student structure overview. We train two student models separately. Each batch includes labeled and unlabeled data and is forwarded twice. The stabilization constraint based on the stable samples is enforced between the students. Each student also learns labeled data by the classification constraint and meets the smooth assumption by the consistency constraint.

[rgb]0,0,0Due to the coupling effect between the two roles in the existing Teacher-Student methods, the teacher does not have more meaningful knowledge compared to the student. [rgb]0,0,0In addition, if the student has [rgb]0,0,0biased predictions for specific samples, the EMA teacher is most likely to accumulate the mistakes and to enforce the student to follow, making the misclassification irreversible. [rgb]0,0,0This is a case of the confirmation bias [33]. Most methods apply a ramp-up operation for the consistency constraint to alleviate the bias, but it is inadequate to solve the problem. From this perspective, training independent models are also beneficial. [rgb]0,0,0Fig. 3 visualizes this inability of the EMA teacher. Three models, , , and , are trained on a two-category task simultaneously. is the student from Mean Teacher. and are two relatively independent but interactive models, [rgb]0,0,0representing the two students from our Dual Student structure ([rgb]0,0,0Section 4). [rgb]0,0,0They have the same initialization, while is different from them. The plot shows how the predictions of a sample from class1 changes with epochs for these three models, blackwhich demonstrates that our method can alleviate the confirmation bias.

4 Dual Student

As analyzed above, the targets from an EMA teacher are not adequate to guide the student when [rgb]0,0,0the number of training steps is large. Therefore, our method gains the loosely coupled targets by training two independent models simultaneously. However, [rgb]0,0,0the outputs of these two models may vary widely, and applying the consistency constraint directly will [rgb]0,0,0cause them to collapse into each other by exchanging the wrong knowledge. The EMA teacher does not suffer from this problem blackdue to the coupling effect.

We propose an efficient way to overcome this problem, which is to [rgb]0,0,0exchange only reliable knowledge of [rgb]0,0,0the models. To put this idea into practice, [rgb]0,0,0we need to solve two problems. One is how to define and acquire reliable knowledge of a model. Another is how to exchange the knowledge mutually. To address them, we define the stable sample in Section 4.1 and then elaborate the derived stabilization constraint for training in Section 4.2.

4.1 Stable Sample

A model can be regarded as a decision function that can make reliable predictions for some samples but not for the others. We define the stable sample and treat it as the reliable knowledge of a model. A stable sample satisfies two conditions. First, according to the smoothness assumption

, a small perturbation should not affect the prediction of this sample, i.e., the model should be smooth in the neighborhood of this sample. [rgb]0,0,0Second, the prediction of this sample [rgb]0,0,0is far from the decision boundary[rgb]0,0,0. This means that this sample has a high probability for the predicted label.

Definition 4.1 (Stable sample).

Given a constant , a dataset that satisfies the smoothness assumption and a model that satisfies for all , is a stable sample with respect to if:

  1. near , their predicted labels are the same.

  2. satisfies the inequality:  . 111,  , 

[rgb]0,0,0Def. 4.1 defines the stable sample, and Fig. 5 illustrates its conditions in details. Notice that the concept of the stable sample is specific to [rgb]0,0,0the models. A data point can be stable with respect to [rgb]0,0,0any one model but may not [rgb]0,0,0be to the others. This fact is a key to our stabilization constraint, [rgb]0,0,0and will be elaborated in Section 4.2. [rgb]0,0,0In addition to the criterion of whether a sample point is stable or not, we would [rgb]0,0,0also like to know the degree of stability of a stable sample . This can be reflected by [rgb]0,0,0the prediction consistency in its neighborhood. The more consistent the predictions are, the more stable is.

4.2 Training by [rgb]0,0,0the Stabilization Constraint

We briefly introduce Dual Student structure before explaining the details on training. [rgb]0,0,0It contains two independent student models, which share the same network architecture with different initial states and are updated separately (Fig. 4). [rgb]0,0,0For our structure to be trainable, we derive a novel stabilization constraint from the stable sample.

In practice, we only utilize two close samples to approximate the conditions of the stable sample to reduce [rgb]0,0,0the computational overhead. Formally, we use [rgb]0,0,0and to represent weights of the two students. [rgb]0,0,0We first define a boolean function , which outputs 1 when the condition is true and 0 otherwise. Suppose is a noisy augmentation of a sample . [rgb]0,0,0We then check whether is a stable sample for student :

(3)

[rgb]0,0,0and are the predicted labels of [rgb]0,0,0and , respectively, by student

. Hyperparameter

is a confidence threshold in . [rgb]0,0,0If the maximum prediction probability of sample exceeds , is considered to be far enough from the classification boundary. [rgb]0,0,0We then use the Euclidean distance to measure the prediction consistency[rgb]0,0,0, to [rgb]0,0,0indicate the stability of [rgb]0,0,0, [rgb]0,0,0as:

(4)

[rgb]0,0,0A smaller means that is more stable to student . [rgb]0,0,0The distance between the predictions of students and can be measured using the mean squared error (MSE) as:

(5)

Finally, the stabilization constraint for the student on sample is written as:

(6)

blackWe calculate the stabilization constraint for the student in the same way. As we can see, the stabilization constraint changes dynamically depending on the outputs of the two students. There are three cases: (1) No constraint is applied if is unstable for both students. (2) If is stable only for student , it [rgb]0,0,0can guide student . (3) [rgb]0,0,0If is stable for both students, the stability is calculated, and the constraint is applied from the more stable one to the other.

Figure 5: Illustration of the [rgb]0,0,0conditions for a stable sample. [rgb]0,0,0Consider three pairs of adjacent data points: (1) [rgb]0,0,0 and do not satisfy the 1st condition, (2) and do not satisfy the 2nd condition, and (3) and satisfy both conditions.

[rgb]0,0,0Following previous works, our Dual Student structure also imposes the consistency constraint [rgb]0,0,0in each student to meet the smoothness assumption. [rgb]0,0,0We also apply the decoupled top layers trick from the Mean Teacher, which splits the constraints for the classification and the smoothness.

To train Dual Student, the final constraint for student is a combination of three parts[rgb]0,0,0: the classification constraint, consistency constraint blackin each model, and stabilization constraint blackbetween models, as:

(7)

where and are hyperparameters to balance the constraints. [rgb]0,0,0Algorithm 1 summarizes the optimization process.

1:[rgb]0,0,0Batch containing labeled and unlabeled samples
2:[rgb]0,0,0Two independent models and
3:for each batch  do
4:     Get , from by data [rgb]0,0,0augmentation
5:     for each model in {, do
6:          Calculate on labeled samples
7:          Calculate by Eq. [rgb]0,0,01 between [rgb]0,0,0and
8:     end for
9:     for each unlabeled sample  do
10:          for each model in {, do
11:               Determine whether is stable by Eq. [rgb]0,0,03
12:          end for
13:          if both and are stable for  then
14:               Calculate the stability of by Eq. [rgb]0,0,05
15:          end if
16:          Calculate for and by Eq. [rgb]0,0,06
17:     end for
18:     Update and by the loss in Eq. [rgb]0,0,07
19:end for
Algorithm 1 Training of Dual Student for SSL.
Model 1k labels 2k labels 4k labels all labels
[17]
 + SN [19]
Temp [17]
Temp + SN [19]
MT [33]
MT + FSWA [1]
CS
DS
MT + FSWA (1200) [1]
Deep CT (600) [27] - - -
DS (600)
Table 1: Test error rate on CIFAR-10 averaged over 5 runs. [rgb]0,0,0Parentheses show numbers of training epochs (default 300).

4.3 Variants of Dual Student

[rgb]0,0,0Here, we briefly discuss two variants of Dual Student, named Multiple Student and Imbalanced Student. Both of them have higher [rgb]0,0,0performances than the standard Dual Student. They do not increase the inference time, even though more computations are required during training.

Multiple Student: Our Dual Student [rgb]0,0,0can be easily extended to Multiple Student. We followed the same strategy as the Deep Co-Training. [rgb]0,0,0We assume that our Multiple Student contains student models. At each iteration, we randomly divide these students into pairs. [rgb]0,0,0Each pair is then updated like Dual Student. [rgb]0,0,0Since our method does not require models to have view differences, the data stream can be shared among [rgb]0,0,0the students. This is different from Deep Co-Training, which requires an exclusive data stream for each pair. In practice, four students [rgb]0,0,0() achieve a notable improvement over two students. However, [rgb]0,0,0having more than four students do not further improve the performance, as demonstrated in Section 5.2.

Imbalanced Student: [rgb]0,0,0Since a well-designed architecture with more parameters usually has better performance, a pre-trained high-performance teacher can be used to improve the light-weight student in knowledge distillation task [9, 10]. blackBased on the same idea, we extend blackDual Student to Imbalanced Student by enhancing the capability of one student. blackHowever, we do not consider the sophisticated model as a teacher, since the knowledge will be exchanged mutually. We [rgb]0,0,0find that the improvement of the weak student is proportional to the capability of the [rgb]0,0,0strong student.

5 Experiments

We first evaluate Dual Student on several common SSL benchmarks, including CIFAR, SVHN, and ImageNet. [rgb]0,0,0We then evaluate the performances of the two variants of Dual Student. We [rgb]0,0,0further analyze various aspects of our method [rgb]0,0,0through ablation experiments. Finally, we [rgb]0,0,0demonstrate the application of Dual Student in a domain adaptation task.

[rgb]0,0,0Unless specified otherwise, the architecture used in our experiments is [rgb]0,0,0a same 13-layer convolutional neural network black(CNN), following previous [rgb]0,0,0works 

[17, 21, 33]. Its details are described in blackAppendix B (Supplementary). As reported in [24], the implementations of recent SSL methods are not exactly same, and the training details ([rgb]0,0,0e.g., number of training epochs, optimizer and augmentation) [rgb]0,0,0may also be different. For a fair comparison, we implement our method following the previous state-of-the-art [1], which uses the standard Batch Norm [12] instead of the mean-only Batch Norm [13]

. The stochastic gradient descent optimizer is adopted with the learning rate adjustment function

, where is the current training step, is the total [rgb]0,0,0number of steps, and is the initial learning rate. These [rgb]0,0,0settings provide better baselines for Model and Mean Teacher. For other methods, we use the results from the original papers. More training details are provided in blackAppendix C (Supplementary).

5.1 SSL Benchmarks

[rgb]0,0,0We first evaluate Dual Student on the CIFAR benchmark, including CIFAR-10 [15] and CIFAR-100 [16]. CIFAR-10 has 50k training samples and 10k testing samples, [rgb]0,0,0from 10 categories. Each sample is a RGB image. We extract 1k, 2k, and 4k balanced labels randomly. CIFAR-100 [16] is a more complex dataset [rgb]0,0,0including 100 categories. Each [rgb]0,0,0category contains only 500 training samples[rgb]0,0,0, together with 100 test samples. We extract 10k balanced labels from it randomly. Besides, we also run experiments with full labels on both datasets. We compare our Dual Student (DS) with some recent consistency-based models, including Model (), Temporal Model (Temp), Mean Teacher (MT), Smooth Neighbor (SN), FastSWA based on Mean Teacher (MT+FSWA), and Deep Co-Training (Deep CT). We also replace the stabilization constraint in our structure with the consistency constraint (CS) as a baseline.

Model 10k labels all labels
Temp [17]
[17]
 + FSWA [1]
MT [33]
MT + FSWA [1]
DS
MT + FSWA (1200) [1]
Deep CT (600) [27] -
DS (480)
Table 2: Test error rate on CIFAR-100 averaged over 5 runs.

[rgb]0,0,0Table 1 shows the results on CIFAR-10. All models are trained for 300 epochs, except for those specified with parentheses. Results marked with a are obtained from other works that published better performances than the original ones. We can see that our Dual Student boosts the performance on all semi-supervised settings. The results reveal that as the number of labeled samples decreases, our method can gain more significant improvements. Specifically, Dual Student improves the result with 1k labels to with only [rgb]0,0,0half of training epochs comparing to blackFastSWA. Similar results [rgb]0,0,0can also be observed in the experiments with 2k and 4k labels. Fig. 6 shows that the accuracy on only the stable samples is higher than that on all samples, which proves that the stable samples represent the relatively more reliable knowledge of a model. This justifies why our DS with stabilization constraint achieves much better results than the CS. Our result on full labels shows less advantages since the labels play a much more important role in the fully supervised case. Table 2 lists the results on CIFAR-100. Especially, in 10k label experiments, Dual Student records a new state-of-the-art with less training epochs than FastSWA and Deep Co-Training.

Figure 6: Test accuracy of each category on the stable samples [rgb]0,0,0and on all samples of CIFAR-10. The performance gap indicates that the stable samples represent relatively more reliable knowledge of a model. The average ratio of the stable samples on the test set is about 85% w.r.t. the model.
Model 250 labels 500 labels
Supervised [33]
MT  [33]
DS
Table 3: Test error rate on SVHN averaged over 5 runs.
Model 10% labels-top1 10% labels-top5
Supervised
MT [33]
DS
Table 4: Test error rate on ImageNet averaged over 2 runs.

To evaluate the generalization ability of Dual Student, we also conduct experiments [rgb]0,0,0on both SVHN [22] and ImageNet [29]. Street View House Numbers (SVHN) is a dataset containing 73,257 training samples and 26,032 testing samples. Each sample is a RGB image with a center close-up of [rgb]0,0,0a house number. We only experiment with 250 and 500 labels on SVHN. ImageNet contains more than 10 million RGB images belonging to 1k categories. We extract 10% balanced labels and train a 50-layer ResNeXt model [34]. [rgb]0,0,0Tables 3 and 4 show that Dual Student could improve the results on [rgb]0,0,0these datasets of various scales.

Model
CIFAR-10
1k labels
CIFAR-100
10k labels
DS
MS (4 models)
MS (8 models)
IS (3.53M params)
IS (11.6M params)
Table 5: Test error rate of two variants of Dual Student (all using the 13-layer CNN) on the CIFAR benchmark averaged over 3 runs. [rgb]0,0,0Parentheses of Multiple Student (MS) indicate the numbers of students. Parentheses of Imbalanced Student (IS) indicate the numbers of parameters for the strong student.
Figure 7: Test accuracy on CIFAR-10 with 1k labels. Left: Combining our method with Mean Teacher can improve its performance. Right: The effectiveness of our stabilization constraint.

5.2 Performance of Variants

We evaluate Multiple Student and Imbalanced Student on the CIFAR benchmark. Table 5 compares them with the standard Dual Student[rgb]0,0,0, all using the same 13-layer CNN trained for 300 epochs. For Multiple Student (MS), we train both the four students and the eight students. The performance improvement is limited when more than four students are trained simultaneously. For Imbalanced Student (IS), we replace one student by a ResNet [8] with Shake-Shake regularization. [rgb]0,0,0We then conduct the experiments on two different model sizes. In particular, a small one with 3.53 million parameters and a large one with 11.65 million parameters. The small ResNet has almost no increase in [rgb]0,0,0computational cost, as its number of parameters is similar to that of the 13-layer CNN (3.13 million parameters). Imbalanced Student achieves a significant performance improvement by distilling the knowledge from a more powerful student. Notably, the large ResNet improves the result from 15.74% to 12.39% on CIFAR-10 with 1k labels.

[rgb]0,0,0Our structure can also be combined with existing methods easily to [rgb]0,0,0further improve the performance. We replace the consistency constraint inside the model by Mean Teacher. Fig. 7 (left) [rgb]0,0,0shows the accuracy curves. The obvious [rgb]0,0,0performance improvement shows the ability of Dual Student in breaking the limits of the EMA teacher. The accuracy of the combination is similar to that using Dual Student only, which means that our method is insensitive to the type of consistency constraint inside each model.

5.3 Ablation Experiments

We conduct [rgb]0,0,0the ablation experiments on CIFAR-10 with 1k labels to analyze the impact of the confidence threshold and [rgb]0,0,0various constraints in our structure.

Confidence threshold: black The confidence threshold controls the 2nd condition in Def. 4.1 [rgb]0,0,0of the stable sample by filtering out samples near to the boundary. Its [rgb]0,0,0actual value can be set approximately, since our method is robust to it. Typically, is related to the complexity of the task, e.g., the number of categories to predict or the size of the given dataset. More categories or [rgb]0,0,0a smaller size would require a smaller . Table 6 compares [rgb]0,0,0different values on the CIFAR benchmark. The results [rgb]0,0,0show that is necessary for a better performance[rgb]0,0,0, and a meticulous tuning may only help improve the performance slightly.

Effect of [rgb]0,0,0the constraints: Dual Student learns the unlabeled data by both between models and the inside each model. We also study [rgb]0,0,0their individual impacts. Besides, we compare the results with the experiment where only the consistency constraint is applied between models (named ). Fig. 7 (right) shows that [rgb]0,0,0reduces the accuracy in the late stage while [rgb]0,0,0helps improve the performance continuously. This demonstrates that our [rgb]0,0,0is better than . [rgb]0,0,0In addition, blackinside the model also plays a role in boosting the performance further.

Dataset (Labels)
CIFAR-10   (1k)
CIFAR-100 (10k)
Table 6: blackMean test error rate on the CIFAR benchmark averaged over 5 runs, with different confidence threshold [rgb]0,0,0values, . [rgb]0,0,0Parentheses show the numbers of the labeled samples.

5.4 Domain Adaptation

Domain adaptation aims to transfer knowledge learned from a labeled dataset to an unlabeled one. French et al. [6] modified Mean Teacher and Temporal Model to enable domain adaptation and showed the effectiveness of the Teacher-Student structure. blackIn this section, we [rgb]0,0,0apply Dual Student for adapting the digit recognition model from USPS to MNIST and show that it could be applied to this [rgb]0,0,0kind of task with great advantages over the EMA teacher [rgb]0,0,0based methods.

Both USPS and MNIST are greyscale hand-written [rgb]0,0,0number dataset. USPS consists of 7,000 images of , and MNIST contains 60,000 images of . To match the image resolution, we resize all images from USPS to

by cubic spline interpolation. Fig. 

8 shows the domain difference between the two datasets. In our experiments, we set USPS as the source domain and MNIST as the target domain. We compare our method with Mean Teacher, [rgb]0,0,0source domain (USPS) supervised model, and target domain (MNIST) supervised model (trained [rgb]0,0,0on 7k balanced labels). All experiments use a small architecture simplified from the above 13-layer CNN. More details [rgb]0,0,0are available in blackAppendix D (Supplementary).

[rgb]0,0,0Fig. 9 shows the test accuracy versus the number of epochs. We can see that naively using supervision from USPS would result in overfitting. Mean Teacher avoids it to some extent and improves the top1 accuracy from 69.09% to 80.41%[rgb]0,0,0, but it overfits when the [rgb]0,0,0number of training epochs is large. Our Dual Student avoids overfitting and [rgb]0,0,0boosts the accuracy to 91.50%, which is much closer to the result obtained by supervision from the target domain.

Figure 8: Domain difference between USPS and MNIST. The [rgb]0,0,0numbers in USPS are in bold font face and span all over the images without border.
Figure 9: Test curves of domain adaptation from USPS to MNIST [rgb]0,0,0versus the number of epochs. Dual Student avoids overfitting and improves the result remarkably.

6 Conclusion

In this paper, we have studied the coupling effect of the existing Teacher-Student methods and [rgb]0,0,0shown that it [rgb]0,0,0sets a performance bottleneck [rgb]0,0,0for the structure. We [rgb]0,0,0have proposed a new structure, Dual Student, to break limits of the EMA teacher[rgb]0,0,0, and a novel stabilization constraint, which provides an effective way to train independent models (either with the same architecture or not). blackThe stabilization constraint is bidirectional overall but is unidirectional for each stable sample. The improved performance is notable across datasets and tasks. Besides, [rgb]0,0,0we have also discussed two variants of Dual Student, with even better results. blackHowever, our method still shares similar limitations as existing methods, [rgb]0,0,0e.g., increased memory usage during training and performance degradation on increasing number of labels. In the future, we [rgb]0,0,0plan blackto address these [rgb]0,0,0issues and extend our structure to other applications.

Appendix A: Convergence of the EMA

In our paper, we state that the EMA teacher is coupled with the student in the existing Teacher-Student methods. We provide below a formal proposition for this statement and a simple proof.

Proposition 1.

Given a sequence and let , where , , . If converges to , then converges to as well.

Proof.

By the definition of convergence, if converges to , we have: , such that , . First, when , by the formula of the sum of a finite geometric series, we rewrite and as:

(8)

Since is finite, and are bounded. Thus, such that:

Since , we have . Thus, such that . Then, after substituting Eq. 8 into and applying the Triangular Inequality, we have:

(9)

Then , we have:

(10)
(11)
(12)

Combining Eq. 9, 10, 11, 12, we have , , i.e., converges to . ∎

Appendix B: Model Architectures

The model architecture used in our CIFAR-10, CIFAR-100, and SVHN experiments is the 13-layer convolutional network (13-layer CNN), which is the same as previous works [33, 17, 1, 19, 27]. We implement it following FastSWA [1] for comparison. Table 7 describes its architecture in details. For ImageNet experiments, we use a 50-layer ResNeXt [34] architecture, which includes 3+4+6+3 residual blocks and uses the group convolution with 32 groups.

Appendix C: Semi-supervised Learning Setups

In our work, all experiments use the SGD optimizer with the nesterov momentum set to

. The learning rate is adjusted by the function , where is the current training step, is the total number of steps, and is the initial learning rate. We present the settings of the experiments on each dataset as follows.

CIFAR-10: On CIFAR-10, we set the batch size to 100 and half of the samples in each batch are labeled. The initial learning rate is . The weight decay is . For the stabilization constraint, we set its coefficient and ramp it up in the first 5 epochs. We set . The confidence threshold for the stable samples is .

CIFAR-100: On CIFAR-100, each minibatch contains 128 samples, including 31 labeled samples. We set the initial learning rate to and the weight decay to . The confidence threshold is . Other hyperparameters are the same as CIFAR-10.

Layer Details
input RGB image
augmentation random translation, horizontal flip
convolution ,

, pad =

same, LReLU =
convolution , , pad = same, LReLU =
convolution , , pad = same, LReLU =
pooling , type = maxpool
dropout =

convolution
, , pad = same, LReLU =
convolution , , pad = same, LReLU =
convolution , , pad = same, LReLU =
pooling , type = maxpool
dropout =

convolution
, , pad = valid, LReLU =
convolution , , LReLU =
convolution , , LReLU =
pooling , type = avgpool
dense , softmax
Table 7: The 13-layer CNN for our SSL experiments.
Layer Details
input Gray image
augmentation gaussian noise =
convolution , , pad = same, LReLU =
pooling , type = maxpool

convolution
, , pad = same, LReLU =
pooling , type = maxpool
dropout =

convolution
, , pad = same, LReLU =
pooling , type = avgpool
dense , softmax
Table 8: The small CNN for domain adaptation.

SVHN: The batch size on SVHN is 100, and each minibatch contains only 10 labeled samples. The initial learning rate is , and the weight decay is . The stabilization constraint is scaled by (ramp up in 5 epochs). We use the confidence threshold .

ImageNet: We validate our method on ImageNet by the ResNeXt-50 architecture on 8 GPUs with batch size and half of the batch are labeled samples. Each sample is augmented following [11] and is resized to . We warm-up the learning rate from to in the first epochs. The model is trained for epochs with the weight decay set to , the stabilization constraint coefficient set to , and a small confidence threshold of .

Appendix D: Domain Adaptation Setups

We design a small convolutional network for the domain adaptation from USPS (source domain) to MNIST (target domain). The structure is shown in Table 8. We train all experiments for 100 epochs by the SGD optimizer with the nesterov momentum set to and the weight decay set to . The learning rate declines from to by a cosine adjustment. Each batch includes 256 samples while 32 of them are labeled. We randomly extract 7000 balanced samples from MNIST for target-supervised experiments, and other experiments are done by using the training set of USPS. The coefficient of the stabilization constraint is . We also ramp it up in the first 5 epochs. The confidence threshold is . We discover that the input noise with is vital for the Mean Teacher but not for our method in this experiment.

References

  • [1] B. Athiwaratkun, M. Finzi, P. Izmailov, and A. G. Wilson (2019) There are many consistent explanations of unlabeled data: why you should average. In Proc ICLR, Cited by: §2.1, Table 1, Table 2, §5, Appendix B: Model Architectures.
  • [2] P. Bachman, O. Alsharif, and D. Precup (2014) Learning with pseudo-ensembles. In Proc. NIPS, Cited by: §2.1.
  • [3] A. Blum and T. Mitchell (1998) Combining labeled and unlabeled data with co-training. In

    Proc. Annual Conference on Computational Learning Theory

    ,
    Cited by: §2.3.
  • [4] O. Chapelle, B. Schölkopf, and A. Zien (2006) Semi-supervised learning. The MIT Press. Cited by: §3.
  • [5] D. Chen, W. Wang, W. Gao, and Z. Zhou (2018) Tri-net for semi-supervised deep learning. In Proc. IJCAI, Cited by: §1.
  • [6] G. French, M. Mackiewicz, and M. Fisher (2018) Self-ensembling for domain adaptation. In Proc. ICLR, Cited by: §5.4.
  • [7] I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In Proc. ICLR, Cited by: §1, §2.1.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. Proc. CVPR. Cited by: §5.2.
  • [9] G. Hinton, O. Vinyals, and J. Dean (2014) Distilling the knowledge in a neural network. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §4.3.
  • [10] G. Hinton, O. Vinyals, and J. Dean (2017) Efficient knowledge distillation from an ensemble of teachers. In Annual Conference of the International Speech Communication Association, Cited by: §4.3.
  • [11] J. Hu, L. Shen, and G. Sun (2017) Squeeze-and-excitation networks. arXiv:1709.01507. Cited by: Appendix C: Semi-supervised Learning Setups.
  • [12] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. ICML, Cited by: §5.
  • [13] S. Ioffe (2017) Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In Proc. NIPS, Cited by: §5.
  • [14] P. Izmailov, D. Podoprikhin, T. Garipov, D. P. Vetrov, and A. G. Wilson (2018) Averaging weights leads to wider optima and better generalization. In Proc. UAI, Cited by: §2.1.
  • [15] A. Krizhevsky, V. Nair, and G. Hinton () CIFAR-10 (canadian institute for advanced research). . Cited by: §5.1.
  • [16] A. Krizhevsky, V. Nair, and G. Hinton () CIFAR-100 (canadian institute for advanced research). . Cited by: §5.1.
  • [17] S. Laine and T. Aila (2017) Temporal ensembling for semi-supervised learning. In Proc. ICLR, Cited by: §1, §1, §2.1, Table 1, Table 2, §5, Appendix B: Model Architectures.
  • [18] W. Li, L. Wang, W. Li, E. Agustsson, and L. V. Gool (2017) WebVision database: visual learning and understanding from web data. External Links: Link, 1708.02862 Cited by: §1.
  • [19] Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang (2018) Smooth neighbors on teacher graphs for semi-supervised learning. In Proc. CVPR, Cited by: §2.1, Table 1, Appendix B: Model Architectures.
  • [20] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther (2016) Auxiliary deep generative models. In Proc. ICML, Cited by: §1.
  • [21] T. Miyato, S. Maeda, S. Ishii, and M. Koyama (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE TPAMI. Cited by: §1, §2.1, §5.
  • [22] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §5.1.
  • [23] A. Odena (2016) Semi-supervised learning with generative adversarial networks. In Data Efficient Machine Learning workshop at ICML, Cited by: §1.
  • [24] A. Oliver, A. Odena, C. Raffel, E. Cubuk, and I. Goodfellow (2018) Realistic evaluation of semi-supervised learning algorithms. In Proc. NeurIPS, Cited by: §5.
  • [25] S. Park, J. Park, S. Shin, and I. Moon (2018) Adversarial dropout for supervised and semi-supervised learning. In Proc. AAAI, Cited by: §2.1.
  • [26] B. T. Polyak and A. Juditsky (1992) Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization. Cited by: §2.1.
  • [27] S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. L. Yuille (2018) Deep co-training for semi-supervised image recognition. In Proc. ECCV, Cited by: §2.1, Table 1, Table 2, Appendix B: Model Architectures.
  • [28] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko (2015) Semi-supervised learning with ladder networks. In Proc. NIPS, Cited by: §1, §2.1.
  • [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §1, §5.1.
  • [30] J. Sietsma and R. Dow (1991) Creating artificial neural networks that generalize. Neural Networks. Cited by: §2.1.
  • [31] J. T. Springenberg (2015) Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv:1511.06390. Cited by: §1.
  • [32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. Cited by: §1, §2.1.
  • [33] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Proc. NIPS, Cited by: §1, §1, §2.1, §3, Table 1, Table 2, Table 3, Table 4, §5, Appendix B: Model Architectures.
  • [34] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proc. CVPR, Cited by: §5.1, Appendix B: Model Architectures.
  • [35] Z. Zhou (2011) When semi-supervised learning meets ensemble learning. In Frontiers of Electrical and Electronic Engineering in China, Cited by: §2.3.
  • [36] X. Zhu (2006) Semi-supervised learning literature survey. TR 1530, University of Wisconsin, Madison. Cited by: §1.