Recently deep learning models have gained significant attention due to their excellent representational capacity and ground breaking results for problems such as object recognition [24, 8], object detection , image segmentation , speech recognition 29], machine translation , autonomous driving 
and so on. In order to train such state-of-the-art models, requirement of large amounts of labelled data is inevitable. However, labelling data is an expensive and time-consuming task and requires extensive human intervention. Thus, availability of labelled data is limited. On the other hand, for most problems, abundant quantity of unlabeled data is not difficult to procure. Several advances in alternative machine learning paradigms which allow models to leverage unlabeled data such as self-supervised learning and semi supervised learning  have been thoroughly investigated. Several other techniques that fall under the umbrella of weakly supervised learning  have made attempts to tackle the scope of utilizing unlabeled data for training machine learning models. In addition to that most of the modern applications of deep learning have been deployed at industrial scale and need to exhibit consistent performance. As an example, for an autonomous driving system to be deemed fit for deployment, an exhaustive test of its calibration  needs to be performed.
Semi-Supervised approaches are focused at utilizing vast amounts of unlabeled data to improve the models trained on limited labelled data. Basic approaches of semi supervised learning use a well known technique known as pseudo-labelling , where the trained model is utilized to infer labels on the unlabeled data and incorporate them into the labelled set for training a model. This expands the labeled set of data and hence in most cases leads to a better trained supervised model. Some other popular approaches for semi-supervised learning include transduction , label propagation  and consistency regularization [25, 15, 22]. All these approaches depend on the supervised model trained on the labelled data, which guide the process of utilization of unlabeled data. There have been several variants of the Pseudo-labelling approach, depending on the process of utilizing the pseudo-labelled data. Self-training  with student-teacher models use a trained model (teacher) to assign labels to unlabeled data and then re-train another model (student model) on the labelled and pseudo-labelled data, which is repeated for a few iterations. In this paper we pay special focus on this popular semi-supervised learning approach and design several experiments to investigate the effect of using an ensemble of teacher models on semi supervised learning and model calibration. Such observations would also lead us to understanding the balance between incorporation of unlabeled data and model calibration.
Pseudo-labelling  has always been one of the most convenient approaches of incorporating unlabeled data in learning a supervised model, especially for the simplicity of implementation and its effectiveness in yielding encouraging results. Another interesting observation that has been made is that pseudo-labelling induces an entropy minimization  effect on the model . The reason for that is, since it selects data on which the model is most confident, such data points lie away from the decision boundary and hence entropy minimization comes into play when we train with such data points. However, it has been agreed on by several prominent works in this area, that pseudo-labelling may in some cases incorporate incorrectly labelled data. The vanilla pseudo-labelling technique uses a trained network to evaluate the unlabeled data using the highest confidence score on the unlabeled samples. However, in some cases it might happen that the model may result in incorrect inference with high confidence. In addition to this poor calibration of networks  may also result in such incorrect labelling. To mitigate this effect, we utilize the self-training approach and carefully select a sample of unlabeled data instead of taking the entire set. This allows us to do away with samples which might include label noise and out-of-distribution data, although our study is mostly empirical and we do not provide any formal proof for the same. The salient contributions of our work are as follows:
We design a framework with an ensemble of teacher models where iterative label improvement is performed. We show the effect of carefully selecting only a sample of unlabeled data and increasing the size of this sample on model accuracy.
We also exhibit a comparative analysis of model calibration on several variants of our approach. We show that the ensemble approach of self-training performs better in terms of model calibration compared to vanilla self-training for semi-supervised learning.
We perform experiments on one of the most popular databases for semi-supervised learning, the STL-10  database and follow up with an extensive set of analysis for the same.
2 Problem Formulation
We consider a collection of examples with . The first examples are labeled as where a discrete set over classes i.e. . The remaining examples for are unlabeled. Denote by the unlabeled set, then is the disjoint union of the two sets . In supervised learning, we use the labeled examples with their corresponding labels
to train a classifier that learns to predict class-labels for previously unseen examples. The guiding principle of semi-supervised learning is to leverage the unlabeled examples as well to train the classifier.
We assume a deep convolutional neural network (DCNN) based classifier trained on the labeled set of exampleswhich takes an example where are the parameters of the model. The model is trained by minimizing the supervised loss -
A typical choice for the loss functionin classification is the cross-entropy
The DCNN can be thought of as the composition of two networks - feature extraction network which transforms an input example to a vector of features and classification network which maps the feature vector to the class vector. Let be the feature vector of . The classification network is usually a fully-connected layer on top of . The output of the network for is and the final prediction is the class with highest probability score i.e
. A trained classifier (at least the feature generator network) is the starting point of the most of the semi-supervised learning techniques, including the studies performed in this work.
Semi-supervised Learning (SSL): There are two main schools of SSL approaches for image classification
Consistency Regularization: An additional loss term called unsupervised-loss is added for either all images or for only unlabeled ones which encourages consistency under various transformations of the data.
where is a transformation of . A choice for consistency loss is is the squared Euclidean distance.
Pseudo-labeling: The unlabeled examples are assigned pseudo-labels thereby expanding the label set to all of . A model is then trained on this labeled set using the supervised loss for the true-labeled examples plus a similar loss for the pseudo-labeled examples.
The current work fits in the realm of the later school where we study the effect of iteratively adding pseudo-labeled examples for self-training.
Self Training using Student-Teacher Models: This class of methods  for SSL iteratively use a trained (teacher) model to pseudo-label a set of unlabeled examples, and then re-train the model (now student) on the labelled plus the pseudo-labelled examples. Usually the same model assumes the dual role of the the student (as the learner) and the teacher (it generates labels, which are then used by itself as a student for learning). A model is trained on the labelled data (using supervised loss equation 1), and is then employed for inference on the unlabeled set . The prediction vectors are converted to one-hot-vectors, where . These examples along with their corresponding (pseudo-)labels are added to the original labelled set. This extended labelled set is used to train another (student) model . This procedure is repeated and the current student model is used as a teacher in the next phase to get pseudo-labels for training another (student) model on the set . Now, for conventional self-training methods we use the entire unlabeled set in every iteration. However, as mentioned above, the most general form of self training can have different sets of unlabeled data (, and so on) in every iteration. The method of selection of from can come from any utility function, the objective of which would be to use the most appropriate unlabeled data samples in each iteration. Some methods even use weights for each (labelled/unlabeled) data sample, which are updated in every iteration, similar to a process followed in Transductive Semi-Supervised Learning  methods, which is borrowed from the traditional concept of boosting used in statistics.
Calibration and Robustness: It is desirable that a classifier should provide a calibrated uncertainty measure in addition to its prediction accuracy. The concept of model calibration stems from the original concept of how probabilistically correct a classifier is , on unseen test data. A classifier is well-calibrated, if the probability associated with the predicted class label matches the probability of such prediction being correct. Recently several studies [18, 7, 17, 14]
have illustrated its importance and demonstrated that deep neural network based classifiers are not well-calibrated. Such models often give overconfident predictions, which are evident by the observation that their average correctness (accuracy) () on unseen test data is far inferior than the average maximum prediction probability for the most probable class. We study the calibration of a classifier in terms of its calibration error.
For a classifier , the average maximum prediction probability is defined as
where is the total numbers of test samples.
The calibration error of the classifier is defined as
The lower the value of , the better-calibrated the model is.
An additional goal for the current study is to investigate the effect of pseudo-labeling on model robustness which will be measure via its calibration error. Contemporary applications such as self driving cars, video surveillance systems etc. are high-regret situations and require well calibrated models.
In the following, we describe our recipe for iterative pseudo-labeling based semi-supervised learning. In nutshell, we iteratively pseudo-label the unlabeled examples and use a set of high-confidence pseudo-labeled examples for re-training the model. There are 3 main ingredients - subsampling, training, pseudo-labeling. These are discussed below. The graphical overview of the approach is shown in Figure 1.
Sub-sampling Let be a set of training examples where each has a corresponding label . We generate random samples of the training set each of size , . An example can be present in more than one of the samples i.e. we do not require to be equal to .
Model Training We train separate models on each of the samples of the training data. The models are also chosen to be different architectures with their separate parameters . The unlabeled examples are then fed to each of the trained models to infer their corresponding probability vectors. We obtain for each , and .
Pseudo-labeling For assigning pseudo-labels to the unlabeled examples, we take the ensemble of the predictions of the individual models.
The unlabeled examples are then sorted in decreasing order of the entropy of the ensemble prediction vectors , and first examples with the lowest entropy are selected. We assign pseudo-labels to these examples as follows:
The rationale for entropy sorting is that having lower entropy prediction vector implies the model being more confident at those examples. In some SSL approaches (e.g. label propagation) use entropy, as a measure of uncertainty, to assign weight to the pseudo-labels. In that sense, selecting top examples translates to having same weight equal to 1 for first examples and weight equal to 0 for all the others.These examples with their corresponding pseudo-labels are added to the training set.
The procedure we follow is outlined in the following steps:
Let be the initial training data.
for to , perform the following steps:
Create random sample of as described above.
Train a separate model on each of the samples.
For each , get its prediction vector from each of the models. .
Compute the ensemble probability vector .
Sort the unlabeled examples in terms of entropy, and choose first examples .
Assign one-hot-encoded pseudo-labelsfor all .
Create the training data for next iteration as :
|Ensemble (Without Subsampling)||0.7045||0.7651||0.7691||0.7743|
|Ensemble (With Subsampling)||0.7045||0.7672||0.7855||0.7888|
3.1 Relation to Transductive SSL
A simple technique for pseudo-labeling based SSL is designed by  where they first train a network on the labeled examples and then assign pseudo-labels for the unlabeled examples according to equation 2 for . A limitation of this approach is that it ignores the variation in the degree of uncertainty in the pseudo-labels for the unlabeled examples - all the pseudo-labelled examples are treated as the same and are all used in the model training. Some methods e.g. developed by  use uncertainty weights in the loss-function during training on the pseudo-labelled examples. In the current work, we only use a set of high-confidence pseudo-labelled examples for training the model. Like , we follow an iterative procedure of pseudo-labeling and model training. Our approach, however, uses only a set of pseudo-labelled examples at each step of the iteration. Moreover, the model at each step of the iteration is an ensemble of models each trained on a separate sample of the data.
3.2 Relation to Teacher-Student Models
Our approach fits in the teacher-student paradigm since we are following the same philosophy of iterating over generating pseudo-labels (teacher) and re-training the model (student). The main difference is that we are using an ensemble of the models to pseudo-label the unlabeled examples and at each iteration the pseudo-labelled examples are used by different models for training. Following the analogy, as a teacher the model aggregates from multiple models and dissipates the learnings to different students.
There are some variants of the teacher-student framework e.g. noisy teacher-student , and mean-teacher  that have used ensembling for semi supervised learning. The utility of self-training has been exhibited in these studies, however there are significant differences of such approaches compared to this study. These approaches do not explore iterative ensembling and using only a subset of unlabeled data in each iteration. The current study also uses a greedy incremental selection of unlabeled data using the entropy of the prediction vector obtained from them, which makes it different from other related work done in this domain.
This section outlines the experiments followed by analysis and ablation studies for additional insights and discussion.
We perform experiments on the STL-10  database which contains 113,000 images (of resolution ) of 10 object categories of resolution . The training set contains 5,000 images and 8,000 images are in the testing set. Each class has 500 images for training and 800 images for testing. The rest of the images are unlabeled which include images from a different distribution (classes other than those which are there in the training set).
4.2 Experimental Protocol
Our study is focused on the impact of ensembling of teacher models on semi-supervised learning and model calibration. In order to present this investigation in a structured fashion we present three primary experiments which are illustrated as follows:
Ensemble with Subsampling In order to create the ensembles we randomly select 4000 samples out of the 5000 labelled samples in the database. We repeat this process and prepare 3 such samples containing 4000 samples in each set. This allows repetition of a data sample in more than one set. We train 3 separate models with different architectures and set of parameters on these 3 sets - each model is trained on 4000 data examples. Once these models are trained we use an ensemble of these models to pseudo-label the unlabeled data. In the next iteration again 3 models will be trained on the labelled plus pseudo-labelled examples. This is essentially an iterative boosting-like method where each iteration is similar to a bagging-like approach.
Ensemble without Subsampling: This experiment is similar to the above, the only difference being that we do not subsample a smaller set for training different models. The 3 separate models are trained on all of the labelled and pseudo-labelled examples in each iteration. For example, in the first iteration we use all of the 5000 samples to train the models, and then the ensemble is used to pseudo-label the unlabeled examples. This experiment helps us understand the effect of sub-sampling while training an ensemble of teacher models.
Non-Ensemble Approach: In this experiment we use a conventional self-training paradigm where only 1 teacher model is used and that is used to pseudo-label the unlabeled data in every iteration. This experiment allows us to observe the benefits of our ensembling approach.
|Ensemble (Without Subsampling)||0.018||0.025||0.042||0.076|
|Ensemble (With Subsampling)||-0.013||0.013||0.032||0.051|
The above three investigations only differ in the process of ensembling, followed by a series of steps which are similar across the experiments. The steps are as follows,
Averaging Predictions: The constituent models are used for inference on the unlabeled data. The prediction vectors from the ensemble models are averaged which gives us one prediction vector for each unlabeled data.
Entropy Sorting: In this step our objective is to use only those unlabeled data samples on which the ensemble is most confident. Such predictions are represented by low entropy on the prediction vector. So the unlabeled samples corresponding to the lowest entropy prediction vectors are chosen for training the individual models for the next iteration.
Pseudo-labelling: The prediction vector obtained in the previous step is converted into a one-hot label for the unlabeled sample. This (pseudo-)label would be used in the next iteration to train the next set of models. It should also be noted that in later iterations, previously pseudo-labelled data (from the unlabeled set) is again put through this step with the latest model available in that iteration.
Combining labeled and unlabeled data: The lowest entropy yielding unlabeled data samples are allowed to participate in the next iteration of training the ensemble models along with the labeled data. This combination would go through the sub-sampling process (in the first experiment only) and then the next steps are repeated.
4.3 Implementation Details
. These networks are used in subsequent iterations in our approach, and their structure and parameters are kept consistent throughout. On the STL-10 database, we subsample 4000 labelled samples randomly with replacement, and prepare 3 such sets. On each set we train a separate model. The ensembling process involves taking the simple average of 3 prediction vectors for each unlabeled sample. For the entropy sorting step we utilize Shannon’s Entropy. For optimizing the models we use backpropagation with Adam optimizer. A learning rate of
has been used with 16 as batch size. The augmentation used in training the labelled and pseudolabelled dataset are Random Horizontal FLip with 0.25 probability, Random Vertical Flip with 0.25 probability, Random Rotation of (-20 degrees, 20 degrees), Random Horizontal shift (-0.25, 0.25), Random Vertical Shift (-0.25, 0.25) and random shear (-0.25, 0.25). We have used the Pytorch library for all the three models. Our codes have been run on a system with Intel Xeon W-1290E CPU, Quadro RTX 6000 GPU with 64GB of RAM.
The results for the three experiments are outlined in Table 1 and Table 2. The primary observation is that, the ensemble with subsampling framework gives the best results in terms of both accuracy (Table 1) and calibration error (Table 2). The experiment with ensembling but without the subsampling step suffers from calibration issues, although the loss in accuracy of the model in the final iteration is marginal. This leads to the primary takeaway from this study, that for semi-supervised learning, an ensemble of teacher models gives good results along with keeping the calibration error in check. The three models have also been chosen to be very different from each other in terms of architecture, although they are not the state-of-the-art models available. The models have been kept relatively simple to exhibit the efficacy of the ensemble. We present some of the salient analysis of our study in the next subsection.
5 Analysis of Results
In this section we delve into the analysis of our investigation from the set of experiments performed.
5.1 Effect of Ensembling
The most promising aspect of our study is the effect of ensembling of the teacher models which has a clear impact on not only the accuracy in each iteration, but also on the calibration error. When we compare this to the experiment where only individual teacher models are trained we observe that they lose out on both these parameters. This leads us to the conclusion that the ensemble of teacher models provide improved performance compared to conventional self training by individual teacher models.
5.2 Effect on Sample Size
Our experiments provide useful insights into the size of the sample utilized from the unlabeled pool of data. Previous studies on self-training by student-teacher models have utilized the entire pool of the unlabeled data available. However, in our case a real world scenario is depicted where label noise (out of distribution data) occurs in the database that we have used. Moreover, using a sample of unlabeled data also allows us to study the impact of sample size on accuracy and model calibration. Due to these reasons, our approach follows the iterative label improvement strategy and we observed that with increase in sample size of unlabeled data, the accuracies on the unseen test set (which is kept constant across all the experiments and iterations) improve. However, we also observe that beyond a certain level, the accuracy improvement is marginal.
5.3 Model Calibration
First observation of this study of classifier calibration is that incorporating pseudo-labels results in less-calibrated models - the calibration errors increases in each iteration when we add more pseudo-labelled data. Thus pseudo-label based paradigms like self-training, SSL, active learning etc.reduce the robustness of the models. The advantages of the ensemble based approach is evident as it provides enhanced calibration, exhibited by the lower calibration error for the ensemble compared to the individual models (Table 2). It is also noted that the sub-sampling before the ensembling helps in further reducing the calibration error. This allows us to obtain a model with not only better accuracy, but also with more probabilistic correctness.
6 Conclusion and Future Work
We present en extensive study of self-training using the student-teacher paradigm where iterative label improvement is performed by repeated pseudo-labelling of unlabeled data. In addition to that we also show that entropy based selection and using a sample of unlabeled data provides a good trade off between accuracy and training time, while keeping the model calibration in check. We would also like to study the effects of using temperature scaling, soft labels, sharpening, Mixup operations etc on the ensembling based paradigm presented in this study. We plan on extending this investigation to prepare a proposed approach for semi-supervised learning which can outperform the present state-of-the art approaches, while maintaining good model calibration and robustness.
-  (2019) MixMatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §1.
-  (2009) Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20 (3), pp. 542–542. Cited by: §4.3.
An analysis of single-layer networks in unsupervised feature learning.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223. Cited by: §1, §4.1.
Unsupervised visual representation learning by context prediction.
IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §1.
-  (2009) Active learning: an introduction. ASQ higher education brief 2 (4), pp. 1–5. Cited by: §4.3.
-  (2005) Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems, Vol. 17. Cited by: §1.
-  (2017) On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §1, §1, §2.
Ghostnet: more features from cheap operations.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1580–1589. Cited by: §1.
-  (2017) Mask r-cnn. In IEEE International Conference on Computer Vision, pp. 2961–2969. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §4.3.
-  (2018) Videomatch: matching based video object segmentation. In European Conference on Computer Vision, pp. 54–70. Cited by: §1.
-  (2019) Label propagation for deep semi-supervised learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5070–5079. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
-  (2019) Beyond temperature scaling: obtaining well-calibrated multiclass probabilities with dirichlet calibration. arXiv preprint arXiv:1910.12656. Cited by: §2.
-  (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §1.
-  (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, Vol. 3. Cited by: §1, §1, §3.1.
-  (2019) Unsupervised temperature scaling: an unsupervised post-processing calibration method of deep networks. arXiv preprint arXiv:1905.00174. Cited by: §2.
-  (2005) Predicting good probabilities with supervised learning. In International Conference on Machine Learning, pp. 625–632. Cited by: §2.
-  (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703. Cited by: §4.3.
-  (2004) Does active learning work? a review of the research. Journal of Engineering Education 93 (3), pp. 223–231. Cited by: §4.3.
-  (2020) Multi-task self-supervised learning for robust speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6989–6993. Cited by: §1.
-  (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. arXiv preprint arXiv:1606.04586. Cited by: §1.
-  (2018) Transductive semi-supervised deep learning using min-max features. In European Conference on Computer Vision, pp. 299–315. Cited by: §1, §2, §3.1.
-  (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §1, §4.3.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780. Cited by: §1, §3.2.
Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8445–8453. Cited by: §1.
Self-training with noisy student improves imagenet classification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §1, §2, §3.2.
-  (2016) Wide residual networks. British Machine Vision Conference. Cited by: §4.3.
-  (2020) Semantics-aware bert for language understanding. In AAAI Conference on Artificial Intelligence, Vol. 34, pp. 9628–9635. Cited by: §1.
-  (2018) A brief introduction to weakly supervised learning. National science review 5 (1), pp. 44–53. Cited by: §1.
Incorporating bert into neural machine translation. arXiv preprint arXiv:2002.06823. Cited by: §1.
-  (2009) Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning 3 (1), pp. 1–130. Cited by: §1, §4.3.