Log In Sign Up

Active Self-Semi-Supervised Learning for Few Labeled Samples Fast Training

Faster training and fewer annotations are two key issues for applying deep models to various practical domains. Now, semi-supervised learning has achieved great success in training with few annotations. However, low-quality labeled samples produced by random sampling make it difficult to continue to reduce the number of annotations. In this paper we propose an active self-semi-supervised training framework that bootstraps semi-supervised models with good prior pseudo-labels, where the priors are obtained by label propagation over self-supervised features. Because the accuracy of the prior is not only affected by the quality of features, but also by the selection of the labeled samples. We develop active learning and label propagation strategies to obtain better prior pseudo-labels. Consequently, our framework can greatly improve the performance of models with few annotations and greatly reduce the training time. Experiments on three semi-supervised learning benchmarks demonstrate effectiveness. Our method achieves similar accuracy to standard semi-supervised approaches in about 1/3 of the training time, and even outperform them when fewer annotations are available (84.10% in CIFAR-10 with 10 labels).


page 1

page 2

page 3

page 4


ReLaB: Reliable Label Bootstrapping for Semi-Supervised Learning

Reducing the amount of labels required to trainconvolutional neural netw...

Self-Updating Models with Error Remediation

Many environments currently employ machine learning models for data proc...

Pseudo-label refinement using superpixels for semi-supervised brain tumour segmentation

Training neural networks using limited annotations is an important probl...

cRedAnno+: Annotation Exploitation in Self-Explanatory Lung Nodule Diagnosis

Recently, attempts have been made to reduce annotation requirements in f...

Pseudo-Label Ensemble-based Semi-supervised Learning for Handling Noisy Soiling Segmentation Annotations

Manual annotation of soiling on surround view cameras is a very challeng...

Bayesian Methods for Semi-supervised Text Annotation

Human annotations are an important source of information in the developm...

Statistical and Algorithmic Insights for Semi-supervised Learning with Self-training

Self-training is a classical approach in semi-supervised learning which ...

1 Introduction

Part of the success of deep learning models stems from large amounts of labeled data

[13]. However, the high cost of acquiring such a large amount of labeled data hinders the widespread application of deep learning models, especially for those fields that require expert annotations, like medical images[37], or marine biology images[1, 8]. Semi-supervised learning is one of the most powerful methods that can drastically reduce the labeled data required[33, 42]. The basic idea is to select confident model predictions as pseudo-labels to update the parameters of the model. Recent works[33, 42] have shown that they can achieve similar accuracy to supervised learning with fewer annotations.

Existing semi-supervised learning using randomly sampled labeled sets suffers from difficulty in further reducing the number of labels and extremely high computational burden. As shown in Fig. 0(a), even the state-of-the-art methods, Flexmatch[42], experience a sharp drop in accuracy when the number of annotations decreases. Besides, semi-supervised learning adds considerable computational burden relative to other methods to improve labeling efficiency, as shown in Fig. 0(b). For example, even on the small dataset CIFAR-10, standard semi-supervised training consumes over one week with single GPU[22], which is about 100 times for supervised training. So in this paper, we try to improve model performance with fewer labels and speed up model training.

(a) Accuracy trained with SOTA semi-supervised learning algorithm, Flexmatch, with different numbers of annotations
(b) Training time of different label-efficient algorithms, assuming active learning consists of 4 annotation-training loops
Figure 1: Weakness of existing semi-supervised learning algorithms: steep performance drop when reducing annotations and high computational burden. Experiments were performed on CIFAR-10. The network architecture adopts WRN-28-2, the details of supervised learning settings follow [39], and the details of semi-supervised learning settings follow [42]

. Because selecting samples in active learning takes much less time than training a model, we estimate the total time directly by multiplying the single-round model training time by the number of active learning rounds. Assuming that active learning consists of 4 rounds

These two phenomenons may be caused by poor pseudo-labels in the early training stage of semi-supervised learning. The result of semi-supervised learning is always closely related to the quality of pseudo-labels. Current semi-supervised models are trained from scratch, so in early training stage, pseudo-labels is poor. When there are enough labeled samples, the supervised loss can gradually guide the model to improve pseudo-labels, but when there are few labeled samples, this constraint becomes less influential, making it hard for semi-supervised model to correct pseudo-labels. Furthermore, improving pseudo-labels from scratch requires a large number of training iterations.

To address these issues, in this paper, we propose Active Self-Semi-Supervised Learning (AS3L) to guide semi-supervised learning by prior pseudo-labels. AS3L does label propagation on self-supervised features to generate good pseudo-labels for bootstrapping semi-supervised learning. Since the quality of pseudo-labels depends not only on the quality of features but also on the quality of labeled samples, especially with few annotations, so we develop an active learning strategy to select labeled samples to generate accurate pseudo-labels. Furthermore, to make better use of these pseudo-labels, we design a training mechanism that integrates priors and model predictions in the early training stage, and then updates or removes priors.

The contributions of our work are summarized as follows:

1. We propose the AS3L framework that extends the success of semi-supervised learning to cases with fewer labels. Our proposed method consistently outperforms state-of-the-art algorithms with fewer annotations.

2. We provide a new idea for accelerating semi-supervised training: bootstrapping with good prior pseudo-labels. Specifically, We note that AS3L consistently achieves  3x speedup, with improved accuracy in most cases.

3. We develop an active learning strategy that is tightly coupled to the AS3L framework, which can greatly improve the quality of the obtained prior pseudo-labels when there are few labeled samples, thereby helping AS3L work well with fewer labels and train quickly.

2 Related Work

2.1 Active Learning

2.1.1 Active Learning Strategy for Supervised Learning

The goal of active Learning is finding and annotating those most valuable samples to help model perform better with limited annotations. Most active sampling strategies work in an annotation-training loop: select, annotate samples, and then retrain the model with labeled samples until the annotation budget is exhausted. Max entropy[24], max-margin[4] and BALD[14] are representatives of uncertainty-based methods. They find some samples for which the model is not confident. K-medoids[31], Coreset[31] and Wass[25] are typical diversity-based strategies. They sample a subset that covers the entire feature space as much as possible. In order to select samples that take into account both diversity and uncertainty, some hybrid strategies have been proposed. Suggestive annotation[37] constructs a candidate set with high uncertainty and then selects diverse samples in this set. BADGE[2] samples in gradient embedding space. All these strategies work in the context of supervised learning, training the model with only labeled samples, so their performance is weaker than strategies working in the context of semi-supervised learning, which is trained with all samples. In this paper, we propose an active strategy in the context of semi-supervised learning to get better performance.

2.1.2 Active strategy for Semi-Supervised Learning

Active learning strategies based on adversarial learning also use unlabeled data for training, but the ability of such methods to use unlabeled data is weaker than existing semi-supervised learning models, so the performance is still far lower than semi-supervised learning[32, 34, 41]. Also, some researchers have tried to directly combine existing semi-supervised learning with these active strategies developed in the context of supervised learning, but the results were unexpectedly poor, even worse than random selection baseline[27]. To bridge this gap, some new active strategies have been designed in the background of semi-supervised learning. These strategies are more tightly coupled with existing semi-supervised learning. Consistency-based methods[15] argue that we should choose samples that are hard for semi-supervised models, i.e. samples with inconsistent predictions for data augmentation. Jiannan[18] proposed to combine adversarial training and graph-based label propagation to select samples close to cluster boundaries with high uncertainty. However, these strategies are multiple-shot strategies, which means unbearably high computational burden in semi-supervised learning scenarios.

2.1.3 Single-shot Active strategy

Single-shot active learning strategy requests all labels in single times, which may help us benefit from active learning with little extra computational burden. Until now, little attention has been paid to this field. Pseudo-annotator[38]

trains multiple models based on randomly guessed labels for building a single-shot active learning strategy. The way of reducing the number of interactions by sacrificing the cost of model training is not feasible in the deep learning era, because the training cost of deep models is much higher than that of traditional machine learning models such as SVM. Also, Xudong

[35] proposes a diversity-based sampling strategy, which select samples in self-supervised feature space. Although this is a similar method to our work, the method is designed according to traditional diversity criteria, and the active learning part and the semi-supervised learning part are independent of each other. However, our method builds a more tightly coupled active self-semi-supervised learning framework and constructs an active learning strategy from the perspective of improving the accuracy of prior pseudo-labels, achieving the effect of reducing both labeling and training costs.

2.2 Semi-Supervised Learning

2.2.1 Semi-Supervised Learning from Scratch

Semi-supervised training typically exploits unlabeled samples with consistency regularization[7, 6, 36, 33] and pseudo-labeling techniques[23, 30], which force the model to predict consistently for various augmented samples. Conventional semi-supervised learning ignores the influence of labeled samples. They build the labeled set by random sampling or stratified random sampling[35]. It makes them perform poorly when only a limited number of annotations are available. Furthermore, semi-supervised learning from scratch are plagued by high training cost. Although a recent work Flexmatch[42] accelerates convergence by curriculum pseudo labeling, it still requires a lot of training iterations. In this paper, we improve these two problems by utilizing information from self-supervised learning.

2.2.2 Semi-Supervised Learning based on Self-Supervised Models

S4L[40] trains model with self-supervised loss(predict rotation[16]) and semi-supervised loss, then re-trains the model with the model’s prediction. In SimCLRv2[10]

, in the first stage, the model is trained in a purely self-supervised manner, after which, supervised training is performed using a subset of labeled samples, followed by self-distillation using all labeled and unlabeled samples. Therefore, the performance of the model strongly depends on the classifiers trained during the supervised training phase. When annotations are limited, it is a reasonable inference that they perform poorly. Recently, Hao Ping

[3] proposed to use label propagation instead of complex semi-supervised techniques. However, self-supervised feature is not perfect for a specific task so that simple label propagation based on this feature does not give ideal results. In this paper, we construct a framework that utilizes pseudo-labels generated by self-supervised features to guide semi-supervised learning.

2.2.3 Train Semi-Supervised Model Faster

In addition to stopping training early, choosing an appropriate unlabeled set for training is the core idea of accelerating semi-supervised training[26, 21, 22]. RETRIEVE[22] is representative of these methods. It constructs a bi-level optimization problem that selects the subset of unlabeled samples that minimizes the loss for labeled samples. But these methods usually lose model accuracy, especially when the number of labeled samples is small. In this paper, we directly reduce the number of training iteration on the basis of guiding training with prior pseudo-labels, and achieve the purpose of accelerating semi-supervised training with almost no loss or even improvement of model accuracy.

2.3 Self-Supervised Learning

Self-supervised learning has received extensive attention for its ability to provide high-quality feature representations without any labels. After some explorations on simple pretext task like predict rotation[16] and jigsaw puzzle[29], contrastive training became mainstream. Contrastive losses in SimCLR[9] and Moco[19, 11] consist of positive pairs (augmented versions of the same input) and negative pairs (different images), and training forces the model to generate more similar representations for positive pairs than for negative pairs. However, the extremely high computational burden is their main disadvantage. BYOL[17] and Simsiam[12] provide a user-friendly way that training model just with positive pairs and take much fewer computation resources. In this paper, we construct our framework based on self-supervised features.

3 Active Self-Semi-Supervised Learning

Suppose we collect an unlabeled dataset with images, , and we have the budget to request the oracle to annotate images. We expect to train the model with labeled dataset and the unlabeled dataset with acceptable training burden. In this section, we first give overview of our active self-semi-supervised learning(AS3L) framework and then describe each module in AS3L: the proposed single-shot active learning strategy, pseudo-labels generation method, and a semi-supervised training method guided by prior pseudo-labels.

3.1 AS3L Framework

Figure 2: The framework of our Active Self-Semi-Supervised learning(AS3L). AS3L consists of four components: (1) Obtaining self-supervised feature follows[12]; (2) Selecting labeled samples based on (Sec. 3.2); (3) Label Propagation based on cluster to get Prior pseudo-labels (Sec. 3.3); (4) Semi-supervised training guided by (Sec. 3.4)

Existing semi-supervised learning methods improve model performance by building a virtuous cycle of updating the model and improving pseudo-labels. In this loop, some unlabeled samples with confident predictions are used to update the model, and then the updated model makes more accurate predictions for unlabeled samples. Generally, the model needs many training iterations to correct its predictions so that it produces sufficiently accurate pseudo-labels. The intuition is that if we have accurate priors at the starting point, the model does not need to be trained for such a large number of iterations. Also, when there are too few labels, the model may not get enough correct guesses to allow the model to enter a virtuous cycle, leading to poor result. Our results (Sec. 4.2) suggest that providing accurate prior pseudo-labels can be a good bootstrapping way to help semi-supervised models enter a virtuous cycle, even with fewer labels, by giving the model more constraints that are likely correct.

In our framework, self-supervised learning and label propagation are used to produce prior pseudo-labels before starting semi-supervised training. To improve the accuracy of these prior labels , we develop an active strategy to select appropriate labeled samples. Additionally, this active strategy is designed as a single-shot strategy to avoid adding too much extra running time.

We incorporate into the existing semi-supervised training framework using the following rationale: ideally, when the model prediction is inaccurate (early stage of semi-supervised training), is used to guide semi-supervised training. And when model predictions are more accurate than , using model predictions as pseudo-labels can leverage semi-supervised learning to correct the pseudo-labels. However, it is difficult to find the exact switching point. We address this by combining a rough switching point and a posterior pseudo-label. The remaining part of the approach is consistent with the existing semi-supervised learning, using supervised loss and consistency regularization to update the model.

To sum up, as shown in Fig. 2, first, we generate a feature representation from self-supervised training. Then, our active learning strategy selects samples and queries their labels, and then propagates annotations in to obtain prior pseudo-labels for unlabeled samples. Finally training a semi-supervised model with the combination of these priors and model predictions.

3.2 Single-shot Active Labeled Sample Selection

(a) Semi-Supervised feature
(b) Self-Supervised feature
Figure 3: T-SNE visualization of semi-supervised and self-supervised features, where semi-supervised features are trained with 40 labels on CIFAR-10 followed our method. Self-supervised features seem to be more loosely clustered, and the boundaries between clusters are not as well defined

The goal of our active learning strategy is to obtain accurate prior pseudo-labels. Thus, in addition to the traditional expectation that selected samples should cover the entire feature space well[31], we also aim to select samples that have the same label as the dominant label in surrounding samples. Based on the observation that in the self-supervised feature space, , samples located in the same cluster would have similar labels and samples near the center of the cluster are more likely to have the same label as the majority of samples in that cluster. We design our active strategy to find samples close to the center of the cluster.

Although self-supervised training has been proven to provide very good features, , with excellent performance in linear evaluation[9, 12], we experimentally find that clustering directly on is not a good choice. One possible reason is that self-supervised features have a different distribution than features trained with labels, as shown in Fig. 3, self-supervised features generally distribute scattering due to their training with finer-grained surrogate tasks. This means that the distance between features of the same class is large, and the distance between features of different classes is small, which will affect the performance of the clustering algorithm and potentially the accuracy of our pseudo-labels. To obtain a more suitable feature space, we fine-tune these features based on the clusters before selecting samples to label, as explained next.

3.2.1 Fine-tuning features

To make samples in the same cluster closer together, we use a mean squared error (MSE) loss to force samples in the same cluster to be close to their centers. Also, to improve robustness, we employ

times K-means on

, with the final loss as defined by Eq. 1, where is sample’s fine-tuned feature, is cluster center corresponding to sample in the cluster. In the process of fine-tuning, since the loss is defined based on times clustering with randomness, those samples that are stable within the same cluster become closer, while other samples attracted by different cluster centers, do not approach one of the centers, thus improving the clustering results.


Additionally, considering the computational cost and still wanting to keep the approximate self-supervised feature structure, we add a single linear layer to the self-supervised trained encoder. During training, the encoder weights are frozen and only the newly added linear layer weights are adjusted. Finally, we select labeled samples and generate prior pseudo-labels based on the linear layer output features .

3.2.2 Select Labeled Set Based on Multiple Clusters

Similarly, perform times K-means clustering on , to improve robustness. We believe that samples occur in same classes in multiple random clustering are more likely to belong to the same semantic category. Therefore, we find the samples that are assigned to the same classes in the times cluster. Then medians are selected to annotate. Detailed algorithm implementation as show in appendix Algorithm 1.

3.3 Prior Pseudo-label Generation

We generate prior pseudo-labels by propagating labels based on clusters. We use the constrained seed K-Means[5] instead of K-means to benefit from labeled sample constraints for better clustering. The labels of the labeled samples are propagated to all unlabeled samples in the same cluster, and then normalized to obtain the prior pseudo-labels. We note that as the number of clusters

increases, the probability of samples in the same cluster having the same label increases. Therefore, in order to improve the accuracy of the prior pseudo-labels, we should increase

. However, as increases, the number of samples contained in each cluster decreases, and the number of clusters that do not contain any labeled samples increase (especially, when is greater than the number of labeled samples), resulting in more unlabeled samples that are not propagated to the labels. As a trade-off, we set different for times clustering such that most of the unlabeled samples will be covered by labeled samples, while those farther away from any labeled samples will have lower confidence.

3.4 Semi-Supervised Training Guided by Prior

After active learning and label propagation, we get a labeled set , an unlabeled set with prior pseudo-labels . We formulate the semi-supervised loss in Eq. 2, where is the cross entropy loss of labeled samples, is the consistency loss for unlabeled samples and is a coefficient for trade-off between these two losses. is defined in Eq. 4, where is a ratio of the number of unlabeled samples and labeled samples in each training batch, is batch size, is an adaptive threshold used in Flexmatch[42], is a random data augmentation function, is the model’s prediction for samples with weak data augmentation, is the model’s prediction for samples with strong data augmentation, is final pseudo-label, is the ‘hard’ one-hot form of and is the cross entropy loss.


As described in Sec. 3.1, semi-supervised models have a strong ability to improve pseudo-labels when entering a virtuous cycle. To allow priors to play a role in guiding training and avoid model overfitting with noisy , we use the normalized sum of model predictions and priors as the final pseudo-labels to guide early training as Eq. 5, where denotes a pre-defined training iteration number. We assume the model has enough correct pseudo-labels after iterations. In order to maximize the effect of semi-supervised training, we tried two options: one option is to remove after iterations, i.e. the model becomes a common semi-supervised training framework, and the other is to update by re-clustering on semi-supervised features. We experimentally find that, except for very few annotations, removing the prior after iteration is a good choice.

4 Experiments

4.1 Setup

4.1.1 Datasets

Our method is evaluated on CIFAR-10, CIFAR-100 and STL-10, a general benchmark for semi-supervised learning. We experimented with labeled sample sets of various sizes, especially with fewer annotation samples than previous papers (10 labeled samples for CIFAR-10, 200 labeled samples for CIFAR-100, and 20 labeled samples for STL-10).

4.1.2 Baseline

We compare our method with Flexmatch, a state of the art semi-supervised method, for both the standard number of training iterations and the case of fewer training iterations. To compare the effect of labeled sample selection, we set the baseline as: Coreset(K-center greedy)[31], K-medoids[31], and stratified random sampling commonly used in semi-supervised learning. Furthermore, we also compare with linear evaluation that trains a linear classifier upon freezing the encoder from self-supervised learning[12].

4.1.3 Implemention details

For a fair comparison, we set similar hyper-parameters and network architectures to most semi-supervised learning algorithms[42]: SGD with momentum 0.9, initial learning rate 0.03, is 1, is 7, batch size is 64, and a cosine annealing learning rate scheduler. Network architecture: WRN-28-2[39] for CIFAR-10, WRN-28-8 for CIFAR-100 and WRN-37-2 for STL-10. The standard semi-supervised algorithm is trained for iterations. And in the remaining experiments comparing active learning strategies, the number of training iteration is set to (for CIFAR-10 with 10 labels) or (others). The change point is 6900. To avoid a heavy computational burden, we adopt Simsiam[12] for self-supervised learning and use the same backbone as the semi-supervised learning stage. We follow Simsiam to set hyper-parameters in the self-supervised training stage. The network weight trained from self-supervised learning is used to initialize encoder in semi-supervised training. Clustering is implemented 6 times ( is 6) in both active sampling strategy and label propagation. For active sampling, the number of classes in each cluster, , is equal to the number of selected samples. For label propagation, clustering is done by Constrained Seed K-Means[5] and is set to 10, 20, 30, 40, 50 and 60, respectively. The linear layer used in Sec. 3.2

has the same dimension that the final layer of the backbone, and fine-tuning features trains for 40 epochs.

4.2 Main results

The experimental results are shown in Table 1. Here, we report best model accuracy following [42]. Our method consistently outperforms other active learning strategies and in most cases even standard semi-supervised learning (with much more training iterations than ours). When the number of labeled samples is close to the number of true classes, random baseline is better than existing active sampling strategies: K-medoids[31] and Coreset-greedy[31] because these strategies cannot cover most classes in the dataset. With enough labeled samples, our method also has similar model accuracy to standard semi-supervised training, but we only take about 1/3 of the training time. Furthermore, for the more complex STL-10 dataset, the accuracy of semi-supervised training is very similar to linear evaluation[12]. This is possibly because the STL-10 dataset contains some images of categories other than the classification task, which affects the semi-supervised algorithm more than it affects the linear evaluation.

Dataset CIFAR-10 CIFAR-100 STL-10
Size of Labeled Set 10 40 200 400 20 40
Sampling Training
Random Early-stop Flexmatch 58.17 82.81 38.84 58.42 51.14 63.33
K-medoids Early-stop Flexmatch 47.26 87.42 43.39 59.10 51.03 68.29
Coreset Early-stop Flexmatch 31.92 86.19 27.59 47.77 45.75 51.41
Proposed Early-stop Proposed 84.43 94.25 51.22 61.07 61.39 70.26
Random Standard Flexmatch 66.07 95.03* 38.05 60.06* 54.30 70.85*
- Supervised 95.38* 80.70* -
Random Linear evaluation 42.26 64.01 34.97 43.05 55.92 66.69
Table 1: Comparison of active sampling strategies, semi-supervised learning, linear evaluation and fully-supervised baseline, where early-stop denotes training few iterations as Sec. 4.1 and standard denotes training

iterations. All results are average over 3 runs. * denotes results from torch-SSL

[42]. The best results are shown in red and the second best results are shown in blue

4.2.1 Training Cost

We compare the training cost from two aspects: the number of training iterations and the actual running time. We note that the computational cost of each self-supervised and semi-supervised iteration is similar when adopting Simsiam and following the hyperparameters setting in Sec. 


. The main difference in computational cost is in the loss function part, so the total number of training iterations is a good approximate metric for training cost, independent of the specific classification task. Also, for accurate comparison, we provide actual run time on a single RTX-3090 GPU. Because training on STL-10 requires multiple GPUs, the multi-GPU implementation may cause some additional runtime differences, so we only report results for the remaining two datasets that can be run on a single GPU. As shown in Table 

2, in most cases, our method is about 3 times faster than standard semi-supervised learning.

# Iteration Running Time / hours
Dataset CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100
Size of Labeled Set 10 40 200/400 10 40 200/400
Flexmatch 1048576 98.82 98.82 353.69
Proposed 192288 335648 21.70 35.21 121.04
Speed up 5.45 3.12 4.55 2.81 2.921
Table 2: Training cost comparison of our method and standard semi-supervised leaning from practical running time on single RTX 3090 GPU and the number of training iteration

4.3 Model Detailed Analysis

We further analyze our approach from three aspects: ablation studies for the whole framework, ablation studies for our active sampling strategy and the detailed analysis on pseudo-label propagation.

4.3.1 Ablation Experiment of Framework

Ablation experiments were performed in CIFAR-10. We evaluate three components of our method: active labeled set selection, prior pseudo-label warm-up and re-clustering to update prior pseudo-labels. The results are shown in Table 3. (1) Actively selecting annotated samples can effectively improve the semi-supervised performance when the amount of annotations is small. (2) Guiding semi-supervised training with early on improves model performance regardless of whether labeled samples are selected by the proposed strategy. This confirms our claim that a better starting point for semi-supervised training can improve model performance. (3) The choice of guidance in the later stage of semi-supervised training is affected by the number of annotations. When the number of annotations is more than the number of true classes, prior guidance is not necessary in the later stage of semi-supervised training, otherwise, the continuously updated prior should be retained.

with 10 labels
Ablation Accuracy Ablation Accuracy
Random 58.1713.01 Active 76.655.29
Random+Prior+Remove 68.0115.96 Active+Prior+Remove 82.955.45
Random+Prior+Update 73.8016.75 Active+Prior+Update 84.435.19
with 40 labels
Ablation Accuracy Ablation Accuracy
Active+Prior+Remove 94.250.43 Active+Prior+Update 90.510.99
Table 3: Ablation Study on CIFAR-10. Accuracy is the average over 3 runs

4.3.2 Ablation Study of Active Sampling Strategy

We compare the effects of various components in our proposed active learning strategy. Ablation experiments were performed on CIFAR-10. Here, K-medoids means clustering only once, and multi-clustering means clustering six times sequentially, as described in Sec. 4.1. As shown in Table 4, fine-tuning the features can lead to significant improvements, better class coverage and more accurate pseudo-label. Especially for the case with fewer annotations, multi-clustering and fine-tuning features can bring greater benefits.

Accuracy of Class Coverage
# Labels 10 40 10 40
K-medoids on 42.50 66.16 7.2 10
Multi-cluster on 61.94 72.54 8.5 10
K-medoids on 64.70 69.40 9.2 10
Multi-cluster on 71.41 72.63 9.4 10
Table 4: Ablation study of proposed active learning strategy on CIFAR-10. Accuracy reported is the average over 3 runs

We experiment with the effect of setting different clustering times on the samples selected by our proposed active learning strategy. The experiment is implemented on CIFAR-10 with 10 labels. As shown in Fig. 4, we note that our strategy is robust to the number of clustering . For different , our active learning strategy can select samples that better cover all classes, even in settings with very few annotations. mainly affects the accuracy of the prior pseudo-labels. The larger

is, the smaller the variance, and the accuracy of pseudo-labels can be slightly improved.

(a) Accuracy of Prior Pseudo-label
(b) Class Coverage
Figure 4: The effect of the number of clusters in an active learning strategy on the accuracy of the generated prior pseudo-labels. Class coverage is robust to , while larger means more accurate

4.3.3 Prior Pseudo-label Propagation

For prior pseudo-label generation, we compare our method with LLGC[43], which is a typical baseline for label spreading, with the hyper-parameters of LLGC following [20]. We also compare the effect of selecting labeled samples using different active learning methods on the accuracy of pseudo-labels in Table 5. Results confirm that the choice of labeled sample selection strategy has a large impact on the accuracy of pseudo-labels. The samples selected by our active learning strategy can yield more accurate pseudo-label when using different label propagation methods. And our pseudo-label propagation method is much better than LLGC when the number of labeled samples is close to the number of true classes, but slightly weaker than LLGC when there are more labels.

Dataset CIFAR-10 CIFAR-100
# Labels 10 20 40 400
Label propagation Sampling strategy
LLGC Random 53.42 59.91 59.59 46.55
LLGC Coreset 31.57 56.06 74.47 43.60
LLGC K-medoids 45.53 62.28 71.51 49.16
LLGC Proposed 62.94 71.61 75.50 51.50
Proposed Random 52.12 60.32 69.40 42.60
Proposed Coreset 37.43 59.56 73.82 38.10
Proposed K-medoids 44.90 62.12 49.62 42.90
Proposed Proposed 71.41 72.63 74.08 47.60
Table 5: Accuracy of prior pseudo-label comparison of different active strategies and label propagation methods. Accuracy reported is the average over 3 runs

Furthermore, we study the impact of the number of classes in clustering, , in our label propagation. The experiments consist of three settings containing different sizes of , as shown in Table 6. Expected calibration error(ECE)[28] is used to describe how well is calibrated to the true accuracy, smaller values indicate less miscalibration. The results confirm that using different in label propagation is a good compromise between accuracy and calibration.

ECE Accuracy of
# Labels 10 20 40 10 20 40
K=[10,10,10,10,10,10] 0.265 0.161 0.114 71.22 72.92 73.06
K=[10,20,30,40,50,60] 0.237 0.087 0.069 71.41 72.63 74.08
K=[10,60,60,60,60,60] 0.214 0.132 0.080 70.83 70.46 74.03
Table 6: Ablation study of on CIFAR-10, accuracy of and Expected calibration error(ECE) reported is the average over 3 runs

5 Conclusions

In this paper, we show that prior pseudo-labels can serve as a good intermediate step to transfer information from self-supervised features to improve semi-supervised training. We also show that a single-shot active learning strategy can enhance this prior. Semi-supervised training guided by this prior can greatly improve the performance of the model with few annotations while reducing computational cost.


  • [1] Inigo Alonso, Matan Yuval, Gal Eyal, Tali Treibitz, and Ana C Murillo. Coralseg: Learning coral segmentation from sparse annotations. Journal of Field Robotics, 36:1456–1477, 2019.
  • [2] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations, 2020.
  • [3] Haoping Bai, Meng Cao, Ping Huang, and Jiulong Shan. Self-supervised semi-supervised learning for data labeling and quality evaluation. arXiv preprint arXiv:2111.10932, 2021.
  • [4] Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In

    International Conference on Computational Learning Theory

    , pages 35–50. Springer, 2007.
  • [5] Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning

    . Citeseer, 2002.

  • [6] David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In International Conference on Learning Representations, 2020.
  • [7] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems, 32, 2019.
  • [8] Michael Bewley, Ariell Friedman, Renata Ferrari, Nicole Hill, Renae Hovey, Neville Barrett, Ezequiel M Marzinelli, Oscar Pizarro, Will Figueira, Lisa Meyer, et al. Australian sea-floor survey data, with images and expert annotations. Scientific data, 2:1–13, 2015.
  • [9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • [10] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255, 2020.
  • [11] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
  • [12] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 15750–15758, 2021.
  • [13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [14] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183–1192. PMLR, 2017.
  • [15] Mingfei Gao, Zizhao Zhang, Guo Yu, Sercan Ö Arık, Larry S Davis, and Tomas Pfister. Consistency-based semi-supervised active learning: Towards minimizing labeling cost. In European Conference on Computer Vision, pages 510–526. Springer, 2020.
  • [16] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, 2018.
  • [17] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271–21284, 2020.
  • [18] Jiannan Guo, Haochen Shi, Yangyang Kang, Kun Kuang, Siliang Tang, Zhuoren Jiang, Changlong Sun, Fei Wu, and Yueting Zhuang. Semi-supervised active learning for semi-supervised models: exploit adversarial examples with graph-based virtual labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2896–2905, 2021.
  • [19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  • [20] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. Label propagation for deep semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5070–5079, 2019.
  • [21] Krishnateja Killamsetty, S Durga, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pages 5464–5474. PMLR, 2021.
  • [22] Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, and Rishabh Iyer. Retrieve: Coreset selection for efficient and robust semi-supervised learning. Advances in Neural Information Processing Systems, 34, 2021.
  • [23] Dong-Hyun Lee et al.

    Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks.

    In Workshop on challenges in representation learning, ICML, volume 3, page 896, 2013.
  • [24] David D Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pages 148–156. Elsevier, 1994.
  • [25] Rafid Mahmood, Sanja Fidler, and Marc T Law. Low budget active learning via wasserstein distance: An integer programming approach. arXiv preprint arXiv:2106.02968, 2021.
  • [26] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pages 6950–6960. PMLR, 2020.
  • [27] Sudhanshu Mittal, Maxim Tatarchenko, Özgün Çiçek, and Thomas Brox. Parting with illusions about deep active learning. arXiv preprint arXiv:1912.05361, 2019.
  • [28] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    , 2015.
  • [29] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer, 2016.
  • [30] Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In International Conference on Learning Representations, 2021.
  • [31] Ozan Sener and Silvio Savarese.

    Active learning for convolutional neural networks: A core-set approach.

    In International Conference on Learning Representations, 2018.
  • [32] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–5981, 2019.
  • [33] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33:596–608, 2020.
  • [34] Toan Tran, Thanh-Toan Do, Ian Reid, and Gustavo Carneiro. Bayesian generative active deep learning. In International Conference on Machine Learning, pages 6295–6304. PMLR, 2019.
  • [35] Xudong Wang, Long Lian, and Stella X Yu. Unsupervised data selection for data-centric semi-supervised learning. arXiv preprint arXiv:2110.03006, 2021.
  • [36] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33:6256–6268, 2020.
  • [37] Lin Yang, Yizhe Zhang, Jianxu Chen, Siyuan Zhang, and Danny Z Chen. Suggestive annotation: A deep active learning framework for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention, pages 399–407. Springer, 2017.
  • [38] Yazhou Yang and Marco Loog. Single shot active learning using pseudo annotators. Pattern Recognit., 89:22–31, 2019.
  • [39] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016.
  • [40] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1476–1485, 2019.
  • [41] Beichen Zhang, Liang Li, Shijie Yang, Shuhui Wang, Zheng-Jun Zha, and Qingming Huang. State-relabeling adversarial active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8756–8765, 2020.
  • [42] Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34, 2021.
  • [43] Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. Advances in neural information processing systems, 16, 2003.