Log In Sign Up

Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning

by   Eric Arazo, et al.

Semi-supervised learning, i.e. jointly learning from labeled an unlabeled samples, is an active research topic due to its key role on relaxing human annotation constraints. In the context of image classification, recent advances to learn from unlabeled samples are mainly focused on consistency regularization methods that encourage invariant predictions for different perturbations of unlabeled samples. We, conversely, propose to learn from unlabeled data by generating soft pseudo-labels using the network predictions. We show that a naive pseudo-labeling overfits to incorrect pseudo-labels due to the so-called confirmation bias and demonstrate that label noise and mixup augmentation are effective regularization techniques for reducing it. The proposed approach achieves state-of-the-art results in CIFAR-10/100 and Mini-Imaget despite being much simpler than other state-of-the-art. These results demonstrate that pseudo-labeling can outperform consistency regularization methods, while the opposite was supposed in previous work. Source code is available at <>.


page 1

page 2

page 3

page 4


Contrastive Regularization for Semi-Supervised Learning

Consistency regularization on label predictions becomes a fundamental te...

Manifold Graph with Learned Prototypes for Semi-Supervised Image Classification

Recent advances in semi-supervised learning methods rely on estimating c...

POPCORN: Progressive Pseudo-labeling with Consistency Regularization and Neighboring

Semi-supervised learning (SSL) uses unlabeled data to compensate for the...

PseudoSeg: Designing Pseudo Labels for Semantic Segmentation

Recent advances in semi-supervised learning (SSL) demonstrate that a com...

PCR: Pessimistic Consistency Regularization for Semi-Supervised Segmentation

Currently, state-of-the-art semi-supervised learning (SSL) segmentation ...

Humble Teacher and Eager Student: Dual Network Learning for Semi-supervised 2D Human Pose Estimation

Semi-supervised learning aims to boost the accuracy of a model by explor...

Code Repositories


Official implementation of "Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning"

view repo


Accompanying notebook and sources to "A Guide to Pseudolabelling: How to get a Kaggle medal with only one model" (Dec. 2020 PyData Boston-Cambridge Keynote)

view repo


My modified version of YoloV5 training, cross-validation and inference with Pseudo Labelling pytorch pipelines used in GWD Kaggle Competition

view repo


My modified version of EfficientDet training, cross-validation and inference with Pseudo Labelling pytorch pipelines used in GWD Kaggle Competition

view repo

1 Introduction

Convolutional neural networks (CNNs) have become the dominant approach for many computer vision tasks [8, 17, 2, 18, 38, 11, 37]. The main requirement to best exploit them is the availability of vast amounts of labeled data [4]. Obtaining such volumes of data, however, is not trivial, and the research community is exploring alternatives to alleviate this [15, 35, 23, 19].

Knowledge transfer via deep domain adaptation [35] is a popular alternative that seeks learning more transferable representations from source to target domains by embedding domain adaptation in the learning pipeline. Other approaches focus exclusively on learning useful representations from scratch in a target domain when annotation constraints are relaxed [23, 9, 5]. Semi-supervised learning [23] focuses on scenarios with sparsely labeled data and extensive amounts of unlabeled data; learning with label noise [9] seeks robust learning when labels are obtained automatically and may not represent the image content; and self-supervised learning [5] uses data supervision to learn from unlabeled data in a supervised manner. This paper focuses on semi-supervised learning for image classification, a recently very active research area [16].

Semi-supervised learning is a transversal task for different domains including images [23], audio [40], time series [6], and text [21]. Recent approaches in image classification are primarily focused on exploiting the consistency in the predictions for the same sample under different perturbations (consistency regularization) [27, 16], while other approaches directly generate labels for the unlabeled data to guide the learning process (pseudo-labeling) [14, 10]. Consistency regularization and pseudo-labeling approaches apply different strategies such as a warm-up phase where training is performed primarily using labeled data [27, 13, 31, 25, 16, 10], uncertainty weighting [28, 16], adversarial attacks [22, 25, 10], or graph-consistency [20, 10]. These strategies deal with confirmation bias [16], also known as the noise accumulation problem [40]

. This bias stems from using incorrect predictions on unlabeled data for training in subsequent epochs and, thereby increasing confidence in incorrect predictions and producing a model that will tend to resist new changes.

This paper explores pseudo-labeling for semi-supervised deep learning from the network predictions and shows that simple modifications to prevent confirmation bias lead to state-of-the-art performance for semi-supervised learning in CIFAR and Mini-ImageNet

[34]. We adopt a similar approach to the one proposed at [30] for relabeling in the context of label noise and apply it exclusively on unlabeled samples. Experiments show that this naive pseudo-labeling is limited by confirmation bias as prediction errors are fit by the network (see Figure 1(a)). To deal with this issue, we propose to use random label noise (i.e. randomly modifying the pseudo-labels for unlabeled samples) as an effective regularization to alleviate confirmation bias. We find that combining mixup augmentation and random label noise further prevents this bias (see Subsection 4.2) and achieves state-of-the-art results (see Subsection 4.3). The proposed method does not require multiple networks to achieve state-of-the-art results [31, 25, 16], nor does it require over a thousand epochs of training to achieve peak performance in every dataset [1], nor need many (ten) forward passes for each sample during training [16]. Compared to other pseudo-labeling approaches, the proposed approach is simpler, in that it does not require graph construction and diffusion [10] or combination with consistency regularization methods, but still achieves state-of-the-art [28] results. Additionally, we are the first to show that pseudo-labeling can be a valuable alternative for semi-supervised learning, as opposed to previous results in the state-of-the-art [23].

2 Related work

Semi-supervised learning for image classification is an active research topic [23]; this section focuses on reviewing work closely related to ours, discussing methods that use deep learning with mini-batch optimization over large image collections. Previous work on semi-supervised deep learning differ in whether they use consistency regularization or pseudo-labeling to learn from the unlabeled set [10], while they all share the use of a cross-entropy loss (or similar) on labeled data.

Consistency regularization

Methods based on this idea impose a simple assumption on the unlabeled data in the training objective: the same sample under different perturbations must produce the same output. This idea was used in [27]

where they apply randomized data augmentation, dropout, and random max-pooling while forcing softmax predictions to be similar. A similar idea is applied by the so-called

-model [13], which also extends the perturbation to different epochs during training, i.e. the current prediction for a sample has to be similar to an ensemble of predictions of the same sample in the past. Here the different perturbations come from networks at different states, dropout, and data augmentation. In [31], the temporal ensembling method is interpreted as a teacher-student problem where the network is both a teacher that produces targets for the unlabeled data as a temporal ensembling, and a student that learns the generated targets by imposing the consistency regularization. [31] naturally re-defines the problem to deal with confirmation bias by separating the teacher and the student. The teacher is defined as a different network with similar architecture whose parameters are updated as an exponential moving average of the student network weights during training. This method is extended in [16]

, where they apply an uncertainty weight over the unlabeled samples to incrementally learn from the unlabeled samples with low uncertainty, with uncertainty defined as the variance or entropy of the predictions for each sample under random perturbations. Additionally, 

[22] uses Virtual Adversarial Training (VAT) to carefully introduce perturbations to data samples as adversarial noise and later impose consistency regularization on the predictions. More recently, Luo et al. [20] propose to use a contrastive loss on the predictions as a regularization that forces predictions to be similar (different) when they are from the same (different) class. This method extends the consistency regularization previously considered only in-between the same data samples [13, 31, 22] to in-between different samples. Their method can be naturally combined with [31] or [22] to boost their performance. Similarly, [33]

proposes Interpolation Consistency Training (ICT), a method inspired by

[39] that encourages predictions at interpolated unlabeled samples to be consistent with the interpolated predictions of individual samples. Additionally, the authors adopt [31]

to estimate the targets used in the consistency regularization of unlabeled samples.

Co-training [25] combines several ideas from the previous works, using two (or more) networks trained simultaneously to agree in their predictions (consistency regularization) and disagree in their errors. Here the errors are defined as making different predictions when exposed to adversarial attacks, thus forcing different networks to learn complementary representations for the same samples. Recently, Chen et al. [3] measure the consistency between the current prediction and an additional prediction of the same sample given by an external memory module that keeps track of previous representations of a sample. They additionally introduce an uncertainty weighting of the consistency term to reduce the contribution of uncertain sample predictions given by the memory module. Consistency regularization methods such as -model [13], mean teachers [31], and VAT [22] have all been shown to benefit from the recent stochastic weight averaging (SWA) method [24, 1]. SWA averages network parameters at different training epochs to move the SGD solution on borders of flat loss regions to their center and improve generalization.


These methods seek the generation of labels or pseudo-labels for unlabeled samples to guide the learning process. An early attempt at pseudo-labeling proposed in [14] uses the network predictions as labels. However, they constrain the pseudo-labeling to a fine-tuning stage, i.e. there is a pre-training or warm-up, as with the consistency regularization approaches. A recent pseudo-labeling approach proposed in [28] uses the network class prediction as hard labels for the unlabeled samples. They also introduce an uncertainty weight for each sample loss, it being higher for samples that have distant -nearest neighbors in terms of feature representation distance. They further include a loss term to encourage intra-class compactness and inter-class separation, and a consistency term between samples with different perturbations. They combine their method with mean teachers [31] to achieve state-of-the-art performance. Finally, a recently published work [10] implements pseudo-labeling through graph-based label propagation. The method alternates between two steps: training from labeled and pseudo-labeled data and using the representations of the network trained to build a nearest neighbor graph where label propagation is applied to refine hard pseudo-labels for unlabeled images. They further add an uncertainty score for every sample (softmax prediction entropy based) and class (class population based) to deal, respectively, with the network predictions not being equally confident over all unlabeled samples and with the class-imbalance problem.

It is important to highlight a widely used practice [27, 13, 31, 25, 16, 10]: a warm-up where labeled samples have a higher (or full) weight at the beginning of training to palliate the incorrect guidance of unlabeled samples early in training. The authors in [23]

also reveal some limitations of current practices in semi-supervised learning such as low quality fully-supervised frameworks, absence of comparison with transfer learning baselines, and pointing out issues related to excessive hyperparameter tuning on large validation sets (not available in real situations in semi-supervised learning).

3 Pseudo-labeling

We formulate semi-supervised image classification as the task to learn a model from a set of training examples . These samples are split into the unlabeled set and the labeled set with

being the one-hot encoding ground-truth label for

classes corresponding to and . In our case, is a CNN and represents the model parameters (weights and biases). As we seek to perform pseudo-labeling for the unlabeled samples, we assume that a pseudo-label is available for these samples. We can then reformulate the problem as training using , being for the labeled samples.

The CNN parameters can be fit by optimizing categorical cross-entropy:



are the softmax probabilities produced by the model and

is applied element-wise. A key decision is how to generate the pseudo-labels for the

unlabeled samples. Previous approaches have used hard pseudo-labels (i.e. one-hot vectors) directly using the network output class

[14, 28] or the class estimated after applying label propagation on a nearest neighbor graph [10]. We adopt the former approach, but use soft pseudo-labels, as we have seen this outperforms hard labels, confirming the observations noted in [30] in the context of relabeling when learning with label noise. In particular, we store the softmax predictions of the network in every mini-batch of an epoch and use them to modify the soft pseudo-label for the unlabeled samples at the end of the epoch. We proceed as described from the second to the last training epoch, while in the first epoch we use the softmax predictions for the unlabeled samples from a model trained in a 10 epochs warm-up phase using labeled data.

Moreover, we use the two regularizations applied in [30] to improve convergence. The first regularization deals with the difficulty of converging at early training stages when the network’s predictions are mostly incorrect and the CNN tends to predict the same class to minimize the loss. Assignment of all samples to a single class is discouraged by adding the following regularization term:



is the prior probability distribution for class

and denotes the mean softmax probability of the model for class across all samples in the dataset. As in [30]

, we assume a uniform distribution

for the prior probabilities ( stands for all classes regularization) and approximate using mini-batches. The second regularization is needed to concentrate the probability distribution of each soft pseudo-label on a single class, thus avoiding the local optima in which the network might get stuck due to a weak guidance:


where denotes the class value of the softmax output and again using mini-batches (i.e. is replaced by the mini-batch size) to approximate this term. This second regularization is the average per-sample entropy ( stands for entropy regularization), a well-known regularization technique in semi-supervised learning [7].

Finally, the total semi-supervised loss is:


where and control the contribution of each regularization term (we set them as in [30], and ).

3.1 Confirmation bias

Network predictions are, of course, sometimes incorrect. This situation is reinforced when incorrect predictions are used as labels for unlabeled samples, as it is the case in pseudo-labeling. Overfitting to incorrect pseudo-labels predicted by the network is known as confirmation bias. It is natural to think that reducing the confidence of the network by artificially changing the labels might alleviate this problem and improve generalization, as was already shown in the context of supervised learning [36]. We therefore propose to introduce label noise by corrupting the label of a percentage of random unlabeled samples. We experimented with different random labels including one hot encodings that resulted in overfitting to the labeled data and softer alternatives that introduced additional hyperparameters to define the label distribution. Consequently, we decided to use a uniform distribution over all classes as noisy soft pseudo-label, which does eliminates the need to select additional hyperparameters and has shown to be effective in other contexts [41]. Subsection 4.2 shows the effect of this label noise on reducing confirmation bias.

Recently, mixup data augmentation [39] introduced a strong regularization technique that combines data augmentation with label corruption, which makes it potentially useful here. Mixup trains on convex combinations of sample pairs ( and ) and corresponding labels ( and ):



is randomly sampled from a beta distribution

, with (e.g. uniformly selects ). This combination regularizes the network to favor simple linear behavior in-between training samples, reducing oscillations in regions far from them. As shown in [32], overconfidence in deep neural networks is a consequence of training on hard labels and is the label smoothing effect from randomly combining and during mixup training which reduces the confidence of predictions and significantly contributes to model calibration. Therefore, when moving to the semi-supervised context via pseudo-labeling, using soft-labels and mixup reduces overfitting to model predictions, which is specially important for unlabeled samples whose predictions are used as soft-labels. We experimentally show in Subsection 4.2 that mixup and label noise reduce confirmation bias and turn pseudo-labeling into a suitable alternative to consistency regularization methods for semi-supervised learning.

4 Experimental work

Dataset CIFAR-10
Noise level (%) 0 5 10 15 20 40 60
C 29.95 35.62 34.46 31.87 29.87 21.50 44.87
M 15.71 14.89 14.26 12.73 12.95 13.74 45.04
Dataset CIFAR-100
Noise level (%) 0 5 10 15 20 40 60
C 48.97 48.56 48.33 47.70 47.60 46.25 48.63
M 41.60 40.80 40.33 39.60 39.63 39.73 45.58
Table 1: CIFAR-10 and CIFAR-100 results for 1K and 4K labels, respectively, under different levels of label noise. Key: C: cross-entropy. M: mixup. Bold denotes best error for each experiment.

4.1 Datasets and training

We use three image classification datasets, CIFAR-10 [12], CIFAR-100 [12], and Mini-ImageNet [34], to validate our approach. Part of the training images are labeled and the remaining are unlabeled. We report best error measures in an independent test set for CIFAR-10/100, while for Mini-ImageNet we report error in an independent test set (using the model from the last epoch).

CIFAR-10 and CIFAR-100

These datasets contain 10 and 100 classes, both with 50K color images for training and 10K for testing with resolution 32×32. We perform experiments with 50, 100, 200, and 400 labeled images per classes in CIFAR-10, i.e. the number of labeled images 0.5K, 1K, 2K, and 4K; and 40 and 100 labeled images per classes in CIFAR-100, i.e.

4K and 10K. For each experiment, we randomly select 10 (3) different splits for CIFAR-10 (100) and report mean and standard deviation. We use the well-known “13-layer network” architecture as in

[1] for CIFAR-10/100. However, we omit dropout [29] as it gives inferior results (see Subsection 4.2).


We emulate the semi-supervised learning setup Mini-ImageNet [34] (a subset of the well-known ImageNet [4] dataset) used in [10]. Train and test sets of 100 classes and 600 color images per class with resolution 84 × 84 are selected from ImageNet, as in [26]. 500 (100) images per-class are kept for train (test) splits. The train and test sets therefore contain 50k and 10k images. As with CIFAR-100, we experiment with 40 and 100 labeled images per class, i.e. the number of labeled images 4K and 10K. We randomly select 3 different splits for each experiment and report mean and standard deviation using the standard ResNet-18 architecture [8], as done in [10].


We use the typical configuration for CIFAR-10 and CIFAR-100 [13] and the same for Mini-ImageNet. Image normalization using dataset mean and standard deviation together and subsequent data augmentation [13] by random horizontal flips and random 2 (6) pixel translations are applied in CIFAR (Mini-ImageNet). We train using SGD with a momentum of 0.9, a weight decay of , and batch size of 100. Training always starts with a high learning rate (0.1 in CIFAR and 0.2 in Mini-ImageNet), dividing it by ten twice during training. We always train the model 400 epochs (reducing learning rate in epochs 250 and 350) and use 10 epoch warm-up with labeled data. Unlike prior work [13, 31], we do not normalize the input images with ZCA, nor add Gaussian noise to the input images, as such operations gave inferior performance in our experiments (similarly reported by [10]). For the regularizations weights and from Eq. 4 we do not attempt careful tuning and just set them to 0.8 and 0.4 as done in [30]. Finally, for stochastic weight averaging [24] we store models every 5 epochs for the last 50 epochs (i.e. we average 10 models in the epochs with lowest learning rate).

4.2 Label noise and mixup effect on confirmation bias

(a) (b)
Figure 1: Example of certainty of incorrect sample predictions during training when using 1K labeled images in CIFAR-10 (a) and 4K in CIFAR-100 (b). Cross-entropy (C) with 0% label noise shows high certainty, while mixup (M) and label noise reduce the certainty of predictions as shown by configurations M and C with 20% label noise. These regularizations therefore reduce confirmation bias, yielding a suitable semi-supervised learning approach.

This section shows that label noise and mixup are effective techniques to improve performance of pseudo-labeling and subsequently demonstrates that they are effective regularizers to alleviate confirmation bias during training.

Using a naive pseudo-labeling leads overfitting to network predictions and by high training accuracy in CIFAR-10 and CIFAR-100. Table 1 reports the effect of label noise and mixup in terms of test error for 1K labels in CIFAR-10 and 4K labels in CIFAR-100. Naive pseudo-labeling leads to an error of 29.95/48.97 for CIFAR-10/100 when training with cross-entropy (C) loss. This error can be reduced with label noise to 21.50 and 46.25, respectively. The same effect is also observed when using mixup (M), which reduces the error to 15.71 and 41.60. The combination of mixup and label noise further reduces the error to 12.73 and 39.60, thus leading to a remarkable overall error reduction of 17.22 and 9.37 compared to naive pseudo-labeling.

We also experimented with dropout regularization [29] added to mixup (M) with 0% noise due to its well-known utility in the supervised context. We add two dropout layers, as in the “13-layer network” [31], and test with dropout , and . Error in CIFAR-10 with each dropout value is 18.86, 20.37, and 22.97 (15.71 without dropout and 12.95 with 20% label noise), while in CIFAR-100 it is 41.19, 42.75, and 46.06 (41.60 without dropout and 39.63 with 20% label noise). These results suggest that label noise is a more effective regularizer than dropout in reducing confirmation bias. Our intuition here is that predictions need to be as good as possible to fully leverage unlabeled data with a pseudo-labeling approach. However, dropout destabilizes predictions in the forward pass, reducing pseudo-labeling performance. Unlike dropout, label noise is only applied to labels, meaning it directly attacks the confirmation bias problem at source: the potentially incorrect labels.

Confirmation bias leads the network to dramatically increase the certainty of incorrect predictions during training. To demonstrate this behavior we compute the average cross-entropy of the softmax output with a uniform for all incorrectly predicted samples :


where is a uniform distribution across the classes and are the incorrect predictions in epoch . A higher value denotes a higher certainty of predictions, which encourages confirmation bias. Figure 1 presents the value obtained for CIFAR-10 (a) and CIFAR-100 (b), showing that label noise and mixup are effective regularizers for reducing prediction certainty for incorrect predictions during training, i.e. confirmation bias is reduced. Note that cross-entropy (C) training with 20% noise has lower confirmation bias than mixup (M) with 0% noise, while the error for M is much lower. Then, this result suggests that not every alternative for reducing confirmation bias leads to a successful pseudo-labeling. For example, removing the regularization term from Eq. 3 reduces the uncertainty on predictions, but at the cost of a weak convergence (the error rate is increased from 15.71 to 29.42 in CIFAR-10 with M: 0% label noise). Therefore, sharp soft pseudo-labels given by entropy regularization, label noise regularization, and mixup augmentation are all key to achieve a successful pseudo-labeling for semi-supervised learning.

Furthermore, we explore two alternatives to exploit different snapshots of the network during training: stochastic weight averaging (SWA) [24] and snapshot ensembles. Regarding SWA, we seek the estimation of an improved average model (see details in hyperparameters of Subsection 4.1) as recently proposed by Athiwaratkun et al. [1] for consistency regularization methods. Despite not clearly improving best performance for our pseudo-labeling approach (13.04 vs 12.95 for 1K labels in CIFAR-10 and 39.51 vs 39.63 for 4K labels in CIFAR-100), it always provides a model very close to it, thus being effective for model selection in scarce annotations scenarios with small validation sets, e.g. the semi-supervised one. Regarding the model snapshots during training, results were close to the best model from Table 1, obtaining 13.11 (best model: 12.95) in CIFAR-10 and 40.07 (best model: 39.63) in CIFAR-100 when ensembling 20 models from the last 40 epochs (even epochs). Ensembles of models with similar performance leads to a boost in performance as uncertainty in predictions helps to decorrelate errors, but overall weaker models decrease ensemble accuracy.

Labeled images 500 1000 2000 4000
Fully supervised (C)* 49.83 1.52 39.31 0.75 28.72 0.69 20.11 0.56
Fully supervised (M)* 40.37 0.87 30.18 0.41 22.55 0.35 16.20 0.2
Consistency regularization methods
model [13] - - - 12.36 0.31
TE [13] - - - 12.16 0.24
MT [31] 27.45 2.64 19.04 0.51 14.35 0.31 11.41 0.25
VAT + EntMin [22] - - - 10.55 0.05
model [13] + SNTG [20] - 21.23 1.27 14.65 0.31 11.00 0.13
TE [13] + VAT [22] + SNTG [20] - - - 9.89 0.34
MA-DNN [3] - - - 11.91 0.22
Deep-Cotraining (2 views) [25] - - - 9.03 0.18
MT [31] + TSSDL [28] - 18.41 0.92 13.54 0.32 9.30 0.55
MT [31] + Label propagation [10] 24.02 2.44 16.93 0.70 13.22 0.29 10.61 0.28
MT [31] + CCL [16] - 16.99 0.71 12.57 0.47 10.63 0.22
MT [31] + fast-SWA [1] - 15.58 0.12 11.02 0.23 9.05 0.21
ICT [33] - 15.48 0.78 9.26 0.09 7.29 0.02
Pseudo-labeling methods
TSSDL [28] - 21.13 1.17 14.65 0.33 10.90 0.23
Label propagation [10] 32.40 1.80 22.02 0.88 15.66 0.35 12.69 0.29
Ours (M+20% noise)* 14.07 0.49 12.63 0.54 9.21 0.58 7.09 0.14
Table 2: Test error in CIFAR-10 for the proposed approach using mixup (M) and 10% of label noise regularization. (*) denotes that we have run the algorithm. Bold indicates lowest error. We report average and standard deviation of 10 runs with different labeled/unlabeled splits.

4.3 Comparison with the state-of-the-art

We compare our pseudo-labeling approach using mixup and label noise against related work that makes use of the “13-layer network” [31] in CIFAR-10 and CIFAR-100. Tables 2 and 3 report results in CIFAR-10 for 0.5, 1, 2, and 4K labels, and in CIFAR-100 for 4 and 10K labels. The proposed approach clearly outperforms all compared related work in CIFAR-10/100 with 20% label noise. The tables divide methods into those based on consistency regularization and those based on pseudo-labeling. Note that we include pseudo-labeling approaches combined with consistency regularization ones (e.g. mean teachers (MT)) in the consistency regularization set. The proposed approach clearly outperforms consistency regularization methods, as well as other purely pseudo-labeling approaches and their combination with consistency regularization methods in CIFAR-10 and CIFAR-100. In particular, the results obtained in CIFAR-10 are on par with ICT [33] for 2000 and 4000 labels and slightly better for 1000 labels. Note that the best result for [1] requires training for 1200 epochs, we train for just 400. When [1] trains for 480 epochs in CIFAR-10 they obtain an error of 15.58 0.12 for 1K labels, a result far from our 12.63 0.54. We have also explored the scenario of 0.5K labels in CIFAR-10 (100:1) obtaining 14.07 0.49. However, to successfully train in this scenario with our method it is necessary to ensure a minimum number (8) of labeled samples per mini-batch (a typical practice in extreme cases) as not doing it led us to 26.1% error. These results demonstrate the generalization of the proposed approach as compared to other methods that fail when decreasing the number of labels. Furthermore, Table 3 demonstrates that the proposed approach successfully scales to higher resolution images, obtaining an over 10 point margin on the best related work in Mini-ImageNet.

Dataset CIFAR-100 Mini-ImageNet
Labeled images 4000 10000 4000 10000
Fully supervised (C)* 59.81 1.06 43.35 0.29 77.42 0.71 64.47 0.33
Fully supervised (M)* 54.49 0.53 41.14 0.26 73.44 0.45 60.28 0.31
Consistency regularization methods
model [13] - 39.19 0.36 - -
TE [13] - 38.65 0.51 - -
MT [31] 45.36 0.49 36.08 0.51 72.51 0.22 57.55 1.11
-model [13] + SNTG [20] - 37.97 0.29 - -
MA-DNN [3] - 34.51 0.61 - -
Deep-Cotraining (2 views) [25] - 38.77 0.28 - -
MT + CCL [16] - 34.81 0.52 - -
MT + Label propagation [10] 43.73 0.20 35.92 0.47 72.78 0.15 57.35 1.66
MT + fast-SWA [1] - 34.10 0.31 - -
Pseudo-labeling methods
Label propagation [10] 46.20 0.76 38.43 1.88 70.29 0.81 57.58 1.47
Ours (M+20% noise)* 39.67 0.13 31.00 0.25 59.05 0.32 44.06 0.17
Table 3: Test error in CIFAR-100 and test error in Mini-ImageNet for the proposed approach using mixup (M) and 20% of label noise regularization. (*) denotes that we have run the algorithm. Bold indicates lowest error. We report average and standard deviation of 3 runs with different labeled/unlabeled splits.

It is worth noting several hyperparameters that require further study to fully explore the capabilities of this approach: the regularization weights and from Eq. 4, for mixup and the level and type of label noise. The relationship between number of labels, dataset complexity, and the level of label noise is also worth exploring. However, we believe it is already interesting that a relatively straightforward modification of pseudo-labeling, designed to tackle confirmation bias, is a competitive approach to semi-supervised learning without requiring consistency regularization, and that future work should take this into account.

5 Conclusions

This paper presented a semi-supervised learning approach for image classification based on pseudo-labeling. We proposed to directly use the network predictions as soft pseudo-labels for unlabeled data and apply label noise and mixup as regularizers to prevent confirmation bias. This simple approach outperforms related work in CIFAR-10/100 and Mini-ImageNet datasets, demonstrating that pseudo-labeling is a suitable alternative to the dominant approach in recent literature: consistency-regularization. The proposed approach is, to the best of our knowledge, both simpler and more accurate than all other recent approaches. Future research will explore careful hyperparameter selection and larger-scale datasets.


  • [1] B. Athiwaratkun, M. Finzi, P. Izmailov, and A.G. Wilson. There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average. In International Conference on Learning Representations (ICLR), 2019.
  • [2] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In European Conference on Computer Vision (ECCV), 2018.
  • [3] Y. Chen, X. Zhu, and S. Gong. Semi-Supervised Deep Learning with Memory. In European Conference on Computer Vision (ECCV), 2018.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2009.
  • [5] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised Representation Learning by Predicting Image Rotations. In International Conference on Learning Representations (ICLR), 2018.
  • [6] M. González, C. Bergmeir, I. Triguero, Y. Rodríguez, and J.M. Benítez. Self-labeling techniques for semi-supervised time series classification: an empirical study. Knowledge and Information Systems, 55(2):493–528, 2018.
  • [7] Y. Grandvalet and Y. Bengio. Semi-supervised Learning by Entropy Minimization. In International Conference on Neural Information Processing Systems (NIPS), 2004.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [9] Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [10] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Label Propagation for Deep Semi-supervised Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [11] C. Kim, F. Li, and J.M. Rehg. Multi-object Tracking with Neural Gating Using Bilinear LSTM. In European Conference on Computer Vision (ECCV), 2018.
  • [12] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • [13] S. Laine and T. Aila. Temporal Ensembling for Semi-Supervised Learning. In International Conference on Learning Representations (ICLR), 2017.
  • [14] D.-H. Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In

    International Conference on Machine Learning Workshops (ICMLW)

    , 2013.
  • [15] W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool. WebVision Database: Visual Learning and Understanding from Web Data. arXiv: 1708.02862, 2017.
  • [16] Y. Li, L. Liu, and R.T. Tan. Certainty-Driven Consistency Loss for Semi-supervised Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [17] T.Y Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal Loss for Dense Object Detection. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [18] W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song. Learning towards Minimum Hyperspherical Energy. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [19] X. Liu, J. Van De Weijer, and A. D. Bagdanov. Exploiting Unlabeled Data in CNNs by Self-supervised Learning to Rank. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [20] Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang. Smooth Neighbors on Teacher Graphs for Semi-Supervised Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [21] T. Miyato, A.M. Dai, and I. Goodfellow. Adversarial Training Methods for Semi-Supervised Text Classification. arXiv: 1605.07725, 2016.
  • [22] T. Miyato, S. Maeda, S. Ishii, and M. Koyama. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [23] A. Oliver, A. Odena, C.A. Raffel, E.D. Cubuk, and I. Goodfellow. Realistic Evaluation of Deep Semi-Supervised Learning Algorithms. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [24] D. Podoprikhin T. Garipov D. Vetrov A.G. Wilson P. Izmailov. Averaging Weights Leads to Wider Optima and Better Generalization. In

    Uncertainty in Artificial Intelligence (UAI)

    , 2018.
  • [25] S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille. Deep Co-Training for Semi-Supervised Image Recognition. In European Conference on Computer Vision (ECCV), 2018.
  • [26] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), 2017.
  • [27] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  • [28] W. Shi, Y. Gong, C. Ding, Z. MaXiaoyu Tao, and N. Zheng. Transductive Semi-Supervised Deep Learning using Min-Max Features. In European Conference on Computer Vision (ECCV), 2018.
  • [29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
  • [30] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa. Joint Optimization Framework for Learning with Noisy Labels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [31] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • [32] Sunil Thulasidasan, Gopinath Chennupati, Jeff Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. arXiv preprint arXiv:1905.11001, 2019.
  • [33] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation consistency training for semi-supervised learning. In Journal of Artificial Intelligence Research (IJCAI), 2019.
  • [34] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  • [35] M. Wang and W. Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
  • [36] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. DisturbLabel: Regularizing CNN on the Loss Layer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [37] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification. In European Conference on Computer Vision (ECCV), September 2018.
  • [38] H. Yao, S. Zhang, R. Hong, Y. Zhang, C. Xu, and Q. Tian. Deep Representation Learning With Part Loss for Person Re-Identification. IEEE Transactions on Image Processing, 28(6):2860–2871, 2019.
  • [39] H. Zhang, M. Cisse, Y.N. Dauphin, and D. Lopez-Paz. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations (ICLR), 2018.
  • [40] Z. Zhang, F. Ringeval, B. Dong, E. Coutinho, E. Marchi, and B. Schüller. Enhanced semi-supervised learning for multimodal emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
  • [41] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled Samples Generated by GAN Improve the Person Re-Identification Baseline in Vitro. In IEEE International Conference on Computer Vision (ICCV), 2017.