Official implementation of "Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning"
Semi-supervised learning, i.e. jointly learning from labeled an unlabeled samples, is an active research topic due to its key role on relaxing human annotation constraints. In the context of image classification, recent advances to learn from unlabeled samples are mainly focused on consistency regularization methods that encourage invariant predictions for different perturbations of unlabeled samples. We, conversely, propose to learn from unlabeled data by generating soft pseudo-labels using the network predictions. We show that a naive pseudo-labeling overfits to incorrect pseudo-labels due to the so-called confirmation bias and demonstrate that label noise and mixup augmentation are effective regularization techniques for reducing it. The proposed approach achieves state-of-the-art results in CIFAR-10/100 and Mini-Imaget despite being much simpler than other state-of-the-art. These results demonstrate that pseudo-labeling can outperform consistency regularization methods, while the opposite was supposed in previous work. Source code is available at <https://git.io/fjQsC5>.READ FULL TEXT VIEW PDF
Official implementation of "Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning"
Accompanying notebook and sources to "A Guide to Pseudolabelling: How to get a Kaggle medal with only one model" (Dec. 2020 PyData Boston-Cambridge Keynote)
My modified version of YoloV5 training, cross-validation and inference with Pseudo Labelling pytorch pipelines used in GWD Kaggle Competition
My modified version of EfficientDet training, cross-validation and inference with Pseudo Labelling pytorch pipelines used in GWD Kaggle Competition
Convolutional neural networks (CNNs) have become the dominant approach for many computer vision tasks [8, 17, 2, 18, 38, 11, 37]. The main requirement to best exploit them is the availability of vast amounts of labeled data . Obtaining such volumes of data, however, is not trivial, and the research community is exploring alternatives to alleviate this [15, 35, 23, 19].
Knowledge transfer via deep domain adaptation  is a popular alternative that seeks learning more transferable representations from source to target domains by embedding domain adaptation in the learning pipeline. Other approaches focus exclusively on learning useful representations from scratch in a target domain when annotation constraints are relaxed [23, 9, 5]. Semi-supervised learning  focuses on scenarios with sparsely labeled data and extensive amounts of unlabeled data; learning with label noise  seeks robust learning when labels are obtained automatically and may not represent the image content; and self-supervised learning  uses data supervision to learn from unlabeled data in a supervised manner. This paper focuses on semi-supervised learning for image classification, a recently very active research area .
Semi-supervised learning is a transversal task for different domains including images , audio , time series , and text . Recent approaches in image classification are primarily focused on exploiting the consistency in the predictions for the same sample under different perturbations (consistency regularization) [27, 16], while other approaches directly generate labels for the unlabeled data to guide the learning process (pseudo-labeling) [14, 10]. Consistency regularization and pseudo-labeling approaches apply different strategies such as a warm-up phase where training is performed primarily using labeled data [27, 13, 31, 25, 16, 10], uncertainty weighting [28, 16], adversarial attacks [22, 25, 10], or graph-consistency [20, 10]. These strategies deal with confirmation bias , also known as the noise accumulation problem 
. This bias stems from using incorrect predictions on unlabeled data for training in subsequent epochs and, thereby increasing confidence in incorrect predictions and producing a model that will tend to resist new changes.
This paper explores pseudo-labeling for semi-supervised deep learning from the network predictions and shows that simple modifications to prevent confirmation bias lead to state-of-the-art performance for semi-supervised learning in CIFAR and Mini-ImageNet. We adopt a similar approach to the one proposed at  for relabeling in the context of label noise and apply it exclusively on unlabeled samples. Experiments show that this naive pseudo-labeling is limited by confirmation bias as prediction errors are fit by the network (see Figure 1(a)). To deal with this issue, we propose to use random label noise (i.e. randomly modifying the pseudo-labels for unlabeled samples) as an effective regularization to alleviate confirmation bias. We find that combining mixup augmentation and random label noise further prevents this bias (see Subsection 4.2) and achieves state-of-the-art results (see Subsection 4.3). The proposed method does not require multiple networks to achieve state-of-the-art results [31, 25, 16], nor does it require over a thousand epochs of training to achieve peak performance in every dataset , nor need many (ten) forward passes for each sample during training . Compared to other pseudo-labeling approaches, the proposed approach is simpler, in that it does not require graph construction and diffusion  or combination with consistency regularization methods, but still achieves state-of-the-art  results. Additionally, we are the first to show that pseudo-labeling can be a valuable alternative for semi-supervised learning, as opposed to previous results in the state-of-the-art .
Semi-supervised learning for image classification is an active research topic ; this section focuses on reviewing work closely related to ours, discussing methods that use deep learning with mini-batch optimization over large image collections. Previous work on semi-supervised deep learning differ in whether they use consistency regularization or pseudo-labeling to learn from the unlabeled set , while they all share the use of a cross-entropy loss (or similar) on labeled data.
Methods based on this idea impose a simple assumption on the unlabeled data in the training objective: the same sample under different perturbations must produce the same output. This idea was used in 
where they apply randomized data augmentation, dropout, and random max-pooling while forcing softmax predictions to be similar. A similar idea is applied by the so-called-model , which also extends the perturbation to different epochs during training, i.e. the current prediction for a sample has to be similar to an ensemble of predictions of the same sample in the past. Here the different perturbations come from networks at different states, dropout, and data augmentation. In , the temporal ensembling method is interpreted as a teacher-student problem where the network is both a teacher that produces targets for the unlabeled data as a temporal ensembling, and a student that learns the generated targets by imposing the consistency regularization.  naturally re-defines the problem to deal with confirmation bias by separating the teacher and the student. The teacher is defined as a different network with similar architecture whose parameters are updated as an exponential moving average of the student network weights during training. This method is extended in 
, where they apply an uncertainty weight over the unlabeled samples to incrementally learn from the unlabeled samples with low uncertainty, with uncertainty defined as the variance or entropy of the predictions for each sample under random perturbations. Additionally, uses Virtual Adversarial Training (VAT) to carefully introduce perturbations to data samples as adversarial noise and later impose consistency regularization on the predictions. More recently, Luo et al.  propose to use a contrastive loss on the predictions as a regularization that forces predictions to be similar (different) when they are from the same (different) class. This method extends the consistency regularization previously considered only in-between the same data samples [13, 31, 22] to in-between different samples. Their method can be naturally combined with  or  to boost their performance. Similarly, 
proposes Interpolation Consistency Training (ICT), a method inspired by that encourages predictions at interpolated unlabeled samples to be consistent with the interpolated predictions of individual samples. Additionally, the authors adopt 
to estimate the targets used in the consistency regularization of unlabeled samples.
Co-training  combines several ideas from the previous works, using two (or more) networks trained simultaneously to agree in their predictions (consistency regularization) and disagree in their errors. Here the errors are defined as making different predictions when exposed to adversarial attacks, thus forcing different networks to learn complementary representations for the same samples. Recently, Chen et al.  measure the consistency between the current prediction and an additional prediction of the same sample given by an external memory module that keeps track of previous representations of a sample. They additionally introduce an uncertainty weighting of the consistency term to reduce the contribution of uncertain sample predictions given by the memory module. Consistency regularization methods such as -model , mean teachers , and VAT  have all been shown to benefit from the recent stochastic weight averaging (SWA) method [24, 1]. SWA averages network parameters at different training epochs to move the SGD solution on borders of flat loss regions to their center and improve generalization.
These methods seek the generation of labels or pseudo-labels for unlabeled samples to guide the learning process. An early attempt at pseudo-labeling proposed in  uses the network predictions as labels. However, they constrain the pseudo-labeling to a fine-tuning stage, i.e. there is a pre-training or warm-up, as with the consistency regularization approaches. A recent pseudo-labeling approach proposed in  uses the network class prediction as hard labels for the unlabeled samples. They also introduce an uncertainty weight for each sample loss, it being higher for samples that have distant -nearest neighbors in terms of feature representation distance. They further include a loss term to encourage intra-class compactness and inter-class separation, and a consistency term between samples with different perturbations. They combine their method with mean teachers  to achieve state-of-the-art performance. Finally, a recently published work  implements pseudo-labeling through graph-based label propagation. The method alternates between two steps: training from labeled and pseudo-labeled data and using the representations of the network trained to build a nearest neighbor graph where label propagation is applied to refine hard pseudo-labels for unlabeled images. They further add an uncertainty score for every sample (softmax prediction entropy based) and class (class population based) to deal, respectively, with the network predictions not being equally confident over all unlabeled samples and with the class-imbalance problem.
It is important to highlight a widely used practice [27, 13, 31, 25, 16, 10]: a warm-up where labeled samples have a higher (or full) weight at the beginning of training to palliate the incorrect guidance of unlabeled samples early in training. The authors in 
also reveal some limitations of current practices in semi-supervised learning such as low quality fully-supervised frameworks, absence of comparison with transfer learning baselines, and pointing out issues related to excessive hyperparameter tuning on large validation sets (not available in real situations in semi-supervised learning).
We formulate semi-supervised image classification as the task to learn a model from a set of training examples . These samples are split into the unlabeled set and the labeled set with
being the one-hot encoding ground-truth label forclasses corresponding to and . In our case, is a CNN and represents the model parameters (weights and biases). As we seek to perform pseudo-labeling for the unlabeled samples, we assume that a pseudo-label is available for these samples. We can then reformulate the problem as training using , being for the labeled samples.
The CNN parameters can be fit by optimizing categorical cross-entropy:
are the softmax probabilities produced by the model andis applied element-wise. A key decision is how to generate the pseudo-labels for the
unlabeled samples. Previous approaches have used hard pseudo-labels (i.e. one-hot vectors) directly using the network output class[14, 28] or the class estimated after applying label propagation on a nearest neighbor graph . We adopt the former approach, but use soft pseudo-labels, as we have seen this outperforms hard labels, confirming the observations noted in  in the context of relabeling when learning with label noise. In particular, we store the softmax predictions of the network in every mini-batch of an epoch and use them to modify the soft pseudo-label for the unlabeled samples at the end of the epoch. We proceed as described from the second to the last training epoch, while in the first epoch we use the softmax predictions for the unlabeled samples from a model trained in a 10 epochs warm-up phase using labeled data.
Moreover, we use the two regularizations applied in  to improve convergence. The first regularization deals with the difficulty of converging at early training stages when the network’s predictions are mostly incorrect and the CNN tends to predict the same class to minimize the loss. Assignment of all samples to a single class is discouraged by adding the following regularization term:
is the prior probability distribution for classand denotes the mean softmax probability of the model for class across all samples in the dataset. As in 
, we assume a uniform distributionfor the prior probabilities ( stands for all classes regularization) and approximate using mini-batches. The second regularization is needed to concentrate the probability distribution of each soft pseudo-label on a single class, thus avoiding the local optima in which the network might get stuck due to a weak guidance:
where denotes the class value of the softmax output and again using mini-batches (i.e. is replaced by the mini-batch size) to approximate this term. This second regularization is the average per-sample entropy ( stands for entropy regularization), a well-known regularization technique in semi-supervised learning .
Finally, the total semi-supervised loss is:
where and control the contribution of each regularization term (we set them as in , and ).
Network predictions are, of course, sometimes incorrect. This situation is reinforced when incorrect predictions are used as labels for unlabeled samples, as it is the case in pseudo-labeling. Overfitting to incorrect pseudo-labels predicted by the network is known as confirmation bias. It is natural to think that reducing the confidence of the network by artificially changing the labels might alleviate this problem and improve generalization, as was already shown in the context of supervised learning . We therefore propose to introduce label noise by corrupting the label of a percentage of random unlabeled samples. We experimented with different random labels including one hot encodings that resulted in overfitting to the labeled data and softer alternatives that introduced additional hyperparameters to define the label distribution. Consequently, we decided to use a uniform distribution over all classes as noisy soft pseudo-label, which does eliminates the need to select additional hyperparameters and has shown to be effective in other contexts . Subsection 4.2 shows the effect of this label noise on reducing confirmation bias.
Recently, mixup data augmentation  introduced a strong regularization technique that combines data augmentation with label corruption, which makes it potentially useful here. Mixup trains on convex combinations of sample pairs ( and ) and corresponding labels ( and ):
is randomly sampled from a beta distribution, with (e.g. uniformly selects ). This combination regularizes the network to favor simple linear behavior in-between training samples, reducing oscillations in regions far from them. As shown in , overconfidence in deep neural networks is a consequence of training on hard labels and is the label smoothing effect from randomly combining and during mixup training which reduces the confidence of predictions and significantly contributes to model calibration. Therefore, when moving to the semi-supervised context via pseudo-labeling, using soft-labels and mixup reduces overfitting to model predictions, which is specially important for unlabeled samples whose predictions are used as soft-labels. We experimentally show in Subsection 4.2 that mixup and label noise reduce confirmation bias and turn pseudo-labeling into a suitable alternative to consistency regularization methods for semi-supervised learning.
|Noise level (%)||0||5||10||15||20||40||60|
|Noise level (%)||0||5||10||15||20||40||60|
We use three image classification datasets, CIFAR-10 , CIFAR-100 , and Mini-ImageNet , to validate our approach. Part of the training images are labeled and the remaining are unlabeled. We report best error measures in an independent test set for CIFAR-10/100, while for Mini-ImageNet we report error in an independent test set (using the model from the last epoch).
These datasets contain 10 and 100 classes, both with 50K color images for training and 10K for testing with resolution 32×32. We perform experiments with 50, 100, 200, and 400 labeled images per classes in CIFAR-10, i.e. the number of labeled images 0.5K, 1K, 2K, and 4K; and 40 and 100 labeled images per classes in CIFAR-100, i.e.
4K and 10K. For each experiment, we randomly select 10 (3) different splits for CIFAR-10 (100) and report mean and standard deviation. We use the well-known “13-layer network” architecture as in for CIFAR-10/100. However, we omit dropout  as it gives inferior results (see Subsection 4.2).
We emulate the semi-supervised learning setup Mini-ImageNet  (a subset of the well-known ImageNet  dataset) used in . Train and test sets of 100 classes and 600 color images per class with resolution 84 × 84 are selected from ImageNet, as in . 500 (100) images per-class are kept for train (test) splits. The train and test sets therefore contain 50k and 10k images. As with CIFAR-100, we experiment with 40 and 100 labeled images per class, i.e. the number of labeled images 4K and 10K. We randomly select 3 different splits for each experiment and report mean and standard deviation using the standard ResNet-18 architecture , as done in .
We use the typical configuration for CIFAR-10 and CIFAR-100  and the same for Mini-ImageNet. Image normalization using dataset mean and standard deviation together and subsequent data augmentation  by random horizontal flips and random 2 (6) pixel translations are applied in CIFAR (Mini-ImageNet). We train using SGD with a momentum of 0.9, a weight decay of , and batch size of 100. Training always starts with a high learning rate (0.1 in CIFAR and 0.2 in Mini-ImageNet), dividing it by ten twice during training. We always train the model 400 epochs (reducing learning rate in epochs 250 and 350) and use 10 epoch warm-up with labeled data. Unlike prior work [13, 31], we do not normalize the input images with ZCA, nor add Gaussian noise to the input images, as such operations gave inferior performance in our experiments (similarly reported by ). For the regularizations weights and from Eq. 4 we do not attempt careful tuning and just set them to 0.8 and 0.4 as done in . Finally, for stochastic weight averaging  we store models every 5 epochs for the last 50 epochs (i.e. we average 10 models in the epochs with lowest learning rate).
This section shows that label noise and mixup are effective techniques to improve performance of pseudo-labeling and subsequently demonstrates that they are effective regularizers to alleviate confirmation bias during training.
Using a naive pseudo-labeling leads overfitting to network predictions and by high training accuracy in CIFAR-10 and CIFAR-100. Table 1 reports the effect of label noise and mixup in terms of test error for 1K labels in CIFAR-10 and 4K labels in CIFAR-100. Naive pseudo-labeling leads to an error of 29.95/48.97 for CIFAR-10/100 when training with cross-entropy (C) loss. This error can be reduced with label noise to 21.50 and 46.25, respectively. The same effect is also observed when using mixup (M), which reduces the error to 15.71 and 41.60. The combination of mixup and label noise further reduces the error to 12.73 and 39.60, thus leading to a remarkable overall error reduction of 17.22 and 9.37 compared to naive pseudo-labeling.
We also experimented with dropout regularization  added to mixup (M) with 0% noise due to its well-known utility in the supervised context. We add two dropout layers, as in the “13-layer network” , and test with dropout , and . Error in CIFAR-10 with each dropout value is 18.86, 20.37, and 22.97 (15.71 without dropout and 12.95 with 20% label noise), while in CIFAR-100 it is 41.19, 42.75, and 46.06 (41.60 without dropout and 39.63 with 20% label noise). These results suggest that label noise is a more effective regularizer than dropout in reducing confirmation bias. Our intuition here is that predictions need to be as good as possible to fully leverage unlabeled data with a pseudo-labeling approach. However, dropout destabilizes predictions in the forward pass, reducing pseudo-labeling performance. Unlike dropout, label noise is only applied to labels, meaning it directly attacks the confirmation bias problem at source: the potentially incorrect labels.
Confirmation bias leads the network to dramatically increase the certainty of incorrect predictions during training. To demonstrate this behavior we compute the average cross-entropy of the softmax output with a uniform for all incorrectly predicted samples :
where is a uniform distribution across the classes and are the incorrect predictions in epoch . A higher value denotes a higher certainty of predictions, which encourages confirmation bias. Figure 1 presents the value obtained for CIFAR-10 (a) and CIFAR-100 (b), showing that label noise and mixup are effective regularizers for reducing prediction certainty for incorrect predictions during training, i.e. confirmation bias is reduced. Note that cross-entropy (C) training with 20% noise has lower confirmation bias than mixup (M) with 0% noise, while the error for M is much lower. Then, this result suggests that not every alternative for reducing confirmation bias leads to a successful pseudo-labeling. For example, removing the regularization term from Eq. 3 reduces the uncertainty on predictions, but at the cost of a weak convergence (the error rate is increased from 15.71 to 29.42 in CIFAR-10 with M: 0% label noise). Therefore, sharp soft pseudo-labels given by entropy regularization, label noise regularization, and mixup augmentation are all key to achieve a successful pseudo-labeling for semi-supervised learning.
Furthermore, we explore two alternatives to exploit different snapshots of the network during training: stochastic weight averaging (SWA)  and snapshot ensembles. Regarding SWA, we seek the estimation of an improved average model (see details in hyperparameters of Subsection 4.1) as recently proposed by Athiwaratkun et al.  for consistency regularization methods. Despite not clearly improving best performance for our pseudo-labeling approach (13.04 vs 12.95 for 1K labels in CIFAR-10 and 39.51 vs 39.63 for 4K labels in CIFAR-100), it always provides a model very close to it, thus being effective for model selection in scarce annotations scenarios with small validation sets, e.g. the semi-supervised one. Regarding the model snapshots during training, results were close to the best model from Table 1, obtaining 13.11 (best model: 12.95) in CIFAR-10 and 40.07 (best model: 39.63) in CIFAR-100 when ensembling 20 models from the last 40 epochs (even epochs). Ensembles of models with similar performance leads to a boost in performance as uncertainty in predictions helps to decorrelate errors, but overall weaker models decrease ensemble accuracy.
|Fully supervised (C)*||49.83 1.52||39.31 0.75||28.72 0.69||20.11 0.56|
|Fully supervised (M)*||40.37 0.87||30.18 0.41||22.55 0.35||16.20 0.2|
|Consistency regularization methods|
|model ||-||-||-||12.36 0.31|
|TE ||-||-||-||12.16 0.24|
|MT ||27.45 2.64||19.04 0.51||14.35 0.31||11.41 0.25|
|VAT + EntMin ||-||-||-||10.55 0.05|
|model  + SNTG ||-||21.23 1.27||14.65 0.31||11.00 0.13|
|TE  + VAT  + SNTG ||-||-||-||9.89 0.34|
|MA-DNN ||-||-||-||11.91 0.22|
|Deep-Cotraining (2 views) ||-||-||-||9.03 0.18|
|MT  + TSSDL ||-||18.41 0.92||13.54 0.32||9.30 0.55|
|MT  + Label propagation ||24.02 2.44||16.93 0.70||13.22 0.29||10.61 0.28|
|MT  + CCL ||-||16.99 0.71||12.57 0.47||10.63 0.22|
|MT  + fast-SWA ||-||15.58 0.12||11.02 0.23||9.05 0.21|
|ICT ||-||15.48 0.78||9.26 0.09||7.29 0.02|
|TSSDL ||-||21.13 1.17||14.65 0.33||10.90 0.23|
|Label propagation ||32.40 1.80||22.02 0.88||15.66 0.35||12.69 0.29|
|Ours (M+20% noise)*||14.07 0.49||12.63 0.54||9.21 0.58||7.09 0.14|
We compare our pseudo-labeling approach using mixup and label noise against related work that makes use of the “13-layer network”  in CIFAR-10 and CIFAR-100. Tables 2 and 3 report results in CIFAR-10 for 0.5, 1, 2, and 4K labels, and in CIFAR-100 for 4 and 10K labels. The proposed approach clearly outperforms all compared related work in CIFAR-10/100 with 20% label noise. The tables divide methods into those based on consistency regularization and those based on pseudo-labeling. Note that we include pseudo-labeling approaches combined with consistency regularization ones (e.g. mean teachers (MT)) in the consistency regularization set. The proposed approach clearly outperforms consistency regularization methods, as well as other purely pseudo-labeling approaches and their combination with consistency regularization methods in CIFAR-10 and CIFAR-100. In particular, the results obtained in CIFAR-10 are on par with ICT  for 2000 and 4000 labels and slightly better for 1000 labels. Note that the best result for  requires training for 1200 epochs, we train for just 400. When  trains for 480 epochs in CIFAR-10 they obtain an error of 15.58 0.12 for 1K labels, a result far from our 12.63 0.54. We have also explored the scenario of 0.5K labels in CIFAR-10 (100:1) obtaining 14.07 0.49. However, to successfully train in this scenario with our method it is necessary to ensure a minimum number (8) of labeled samples per mini-batch (a typical practice in extreme cases) as not doing it led us to 26.1% error. These results demonstrate the generalization of the proposed approach as compared to other methods that fail when decreasing the number of labels. Furthermore, Table 3 demonstrates that the proposed approach successfully scales to higher resolution images, obtaining an over 10 point margin on the best related work in Mini-ImageNet.
|Fully supervised (C)*||59.81 1.06||43.35 0.29||77.42 0.71||64.47 0.33|
|Fully supervised (M)*||54.49 0.53||41.14 0.26||73.44 0.45||60.28 0.31|
|Consistency regularization methods|
|model ||-||39.19 0.36||-||-|
|TE ||-||38.65 0.51||-||-|
|MT ||45.36 0.49||36.08 0.51||72.51 0.22||57.55 1.11|
|-model  + SNTG ||-||37.97 0.29||-||-|
|MA-DNN ||-||34.51 0.61||-||-|
|Deep-Cotraining (2 views) ||-||38.77 0.28||-||-|
|MT + CCL ||-||34.81 0.52||-||-|
|MT + Label propagation ||43.73 0.20||35.92 0.47||72.78 0.15||57.35 1.66|
|MT + fast-SWA ||-||34.10 0.31||-||-|
|Label propagation ||46.20 0.76||38.43 1.88||70.29 0.81||57.58 1.47|
|Ours (M+20% noise)*||39.67 0.13||31.00 0.25||59.05 0.32||44.06 0.17|
It is worth noting several hyperparameters that require further study to fully explore the capabilities of this approach: the regularization weights and from Eq. 4, for mixup and the level and type of label noise. The relationship between number of labels, dataset complexity, and the level of label noise is also worth exploring. However, we believe it is already interesting that a relatively straightforward modification of pseudo-labeling, designed to tackle confirmation bias, is a competitive approach to semi-supervised learning without requiring consistency regularization, and that future work should take this into account.
This paper presented a semi-supervised learning approach for image classification based on pseudo-labeling. We proposed to directly use the network predictions as soft pseudo-labels for unlabeled data and apply label noise and mixup as regularizers to prevent confirmation bias. This simple approach outperforms related work in CIFAR-10/100 and Mini-ImageNet datasets, demonstrating that pseudo-labeling is a suitable alternative to the dominant approach in recent literature: consistency-regularization. The proposed approach is, to the best of our knowledge, both simpler and more accurate than all other recent approaches. Future research will explore careful hyperparameter selection and larger-scale datasets.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
International Conference on Machine Learning Workshops (ICMLW), 2013.
Uncertainty in Artificial Intelligence (UAI), 2018.