Log In Sign Up

Progressive Representative Labeling for Deep Semi-Supervised Learning

by   Xiaopeng Yan, et al.

Deep semi-supervised learning (SSL) has experienced significant attention in recent years, to leverage a huge amount of unlabeled data to improve the performance of deep learning with limited labeled data. Pseudo-labeling is a popular approach to expand the labeled dataset. However, whether there is a more effective way of labeling remains an open problem. In this paper, we propose to label only the most representative samples to expand the labeled set. Representative samples, selected by indegree of corresponding nodes on a directed k-nearest neighbor (kNN) graph, lie in the k-nearest neighborhood of many other samples. We design a graph neural network (GNN) labeler to label them in a progressive learning manner. Aided by the progressive GNN labeler, our deep SSL approach outperforms state-of-the-art methods on several popular SSL benchmarks including CIFAR-10, SVHN, and ILSVRC-2012. Notably, we achieve 72.1 challenging ImageNet benchmark with only 10% labeled data.


page 2

page 8


Pseudo-Representation Labeling Semi-Supervised Learning

In recent years, semi-supervised learning (SSL) has shown tremendous suc...

Progressive Self-Distillation for Ground-to-Aerial Perception Knowledge Transfer

We study a practical yet hasn't been explored problem: how a drone can p...

A soft nearest-neighbor framework for continual semi-supervised learning

Despite significant advances, the performance of state-of-the-art contin...

Boosting Facial Expression Recognition by A Semi-Supervised Progressive Teacher

In this paper, we aim to improve the performance of in-the-wild Facial E...

On the Importance of Calibration in Semi-supervised Learning

State-of-the-art (SOTA) semi-supervised learning (SSL) methods have been...

TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

Deep learning (DL) has achieved unprecedented success in a variety of ta...

Confident Sinkhorn Allocation for Pseudo-Labeling

Semi-supervised learning is a critical tool in reducing machine learning...

1 Introduction

Deep neural networks (DNNs) have been dominating the field of computer vision and even surpassed human-level performance for visual recognition 

(Simonyan and Zisserman, 2014; Deng et al., 2009; He et al., 2016). State-of-the-art visual recognition models for a wide range of tasks rely on supervised training, which requires large-scale human-labeled data. However, annotating data is expensive and sometimes involves expert knowledge. The expensive human annotation hinders the further development of data-hungry DNNs. Alternatively, semi-supervised learning (SSL) (Zhu, 2005) leverages unlabeled data to improve a model’s performance when only limited labeled data are available. As collecting large-scale unlabeled data is more practical and cheaper than labeled data, Deep SSL (Sajjadi et al., 2016; Xie et al., 2019a; Berthelot et al., 2019; Laine and Aila, 2016; Miyato et al., 2018) has been an emerging research direction.

Pseudo-labeling is a simple but effective approach in Deep SSL. Previous approaches train an inductive (e.g., a DNN model (Lee, 2013; Yalniz et al., 2019; Rosenberg et al., 2005)) or transductive (e.g., label propagation (Iscen et al., 2019; Zhuang et al., 2019)) model on the labeled set and pseudo-label the entire unlabeled set. We argue that these approaches have two limitations: 1) For sampling, most of these methods filter out unlabeled data by a hand-crafted confidence rule. As shown in Figure (a)a, unlabeled samples with high confidence are more likely to be close to the labeled data. Unfortunately, these samples are not representatives of the entire data distribution and such a sampling strategy is not the optimal way to capture the intrinsic structure of the whole unlabeled set. 2) For labeling, these approaches diffuse labels from the labeled data to the unlabeled data in a one-step manner. However, it seems extremely challenging to expand the labeling to the entire unlabeled data space, especially when the labeled samples are scarce. Thus, an effective sampling and labeling approach to utilize unlabeled data in deep SSL is awaiting exploration.

In this paper, we propose an indegree sampler to select the most representative samples for deep SSL as shown in Figure (b)b. As representative samples should be contiguous to as many samples as possible in the feature space, a directed k-nearest-neighbor (kNN) graph is constructed over all samples and then the samples (i.e., nodes in the graph) with high ranking by their indegrees are selected as representatives. With the indegree sampler, we can select the samples in the high-density region to capture the structure of the sample space. The boundary assumption (Chapelle and Zien, 2005) states that the decision boundary should not across the high-density region of a cluster. In other words, nodes lying in the high-density region are often reliable and are representative of the cluster. After sampling, we can apply an SSL approach to label the representatives.

(a) Sample selection by confidence
(b) Sample selection by indegree
Figure 3: Samples selected by different samplers. Confidence sampler selects samples by ranking their confidence predicted by DNN, while indegree sampler sorts their indegrees on a constructed directed kNN graph. ‘Red’ nodes represent labeled samples and ‘green’ nodes represent selected samples. We can see that confidence sampler selects samples close to the labeled data, while indegree sampler selects representatives capturing the structure of the whole dataset.

For labeling, we employ graph neural networks (GNNs) (Kipf and Welling, 2016)

for its popularity in graph-based SSL. GNNs nicely integrate feature extraction and graph topology in the design. To effectively propagate labels from the very few human-labeled samples, we train a GNN to label representatives and progressively expand the labeled training set. In detail, the following steps are performed repeatedly: First, representatives are selected by sorting the indegrees among all remaining unlabeled samples. Second, a GNN labeler is trained on the labeled set and makes predictions on the representatives. Finally, the representatives with high confidence are assigned by hard labels and added to the labeled set. For simplicity, we use two-layer simplified graph convolutional networks (SGC) 

(Wu et al., 2019) as the GNN labeler. To save computational cost, we do not train the CNN feature extractor with the new labeled set, although the performance of deep SSL may be further improved.

To demonstrate the effectiveness of the proposed progressive representative labeling (PRL) approach, we apply state-of-the-art deep SSL methods (e.g. consistency regularization) on the labeled set, pseudo-labeled representative set and remaining unlabeled set generated by PRL. Our deep SSL framework includes three stages: supervised DNN training, PRL111PRL is lightweight compared to the DNN training., and semi-supervised DNN finetuning.

Our main contributions are three folds:

1) We propose to pseudo-label representative samples for expanding the labeled set in deep SSL. Indegrees on a directed kNN graph are used for representatives selections.

2) We propose a GNN to label these representatives in a progressive learning manner.

3) We demonstrate the effectiveness of the PRL approach with extensive experiments on several SSL benchmarks, including CIFAR-10, SHVN and ILSVRC-2012. Notably, we achieve top-1 accuracy, surpassing the previous best result by , on the challenging ImageNet benchmark with only labeled data.

2 Related Work

Semi-supervised learning (SSL) is one of the classic problems in machine learning (Zhu, 2005). This section reviews the literature of deep SSL, an emerging topic in the deep learning era. The research of deep SSL can be divided into two major streams: pseudo-labeling, and regularization.

2.1 Pseudo-labeling for Deep SSL

Pseudo-labeling methods (Lee, 2013; Yalniz et al., 2019; Rosenberg et al., 2005) aim to take advantage of unlabeled data by assigning predicted labels to them.  (Lee, 2013)

infers pseudo-labels of unlabeled data by picking up the class with the largest probability and then fine-tunes the network with cross-entropy loss. Simply selecting the class with the largest probability is easy to bring noisy labels. To avoid this,  

(Yalniz et al., 2019) proposes a novel strategy of data sampling to help select reliable samples. The sampler first ranks the confidence within each individual class and then choose top-k samples for each class.  (Xie et al., 2019b) infers noisy labels for unlabeled data and train a student model together with labeled data. Compared with these methods which do not take into account the importance of representative samples in SSL, our PRL approach pseudo-labels only the most representative samples among unlabeled data, selected with the largest indegree on the structure of data space modeled by a directed kNN graph.

2.2 Regularization for Deep SSL

Regularization-based approaches, which optimize a heuristically-motivated objective, have been successful in deep SSL. Consistency regularization enforces that the model’s output remains unchanged when the input is perturbed 

(Sajjadi et al., 2016; Xie et al., 2019a; Laine and Aila, 2016). Entropy Minimization (Grandvalet and Bengio, 2005; Miyato et al., 2018) encourages the model’s output distribution to have low entropy (i.e., to make “high-confidence” predictions) on unlabeled data. MixMatch (Berthelot et al., 2019) also implicitly achieves entropy minimization through the use of a “sharpening” function on the target distribution for unlabeled data. Our PRL approach is complementary to regularization-based approaches, and we demonstrate its effectiveness with the simple consistency regularization (Xie et al., 2019a).

2.3 Graph Neural Networks for Graph-based SSL

Graph neural networks (GNN) generalize convolutional neural networks to the graph domain 

(Kipf and Welling, 2016; Li et al., 2018; Wu et al., 2019; Veličković et al., 2017). The GNN model can naturally be applied to graph-based SSL, as it combines graph structures and node features in the convolution. In the GNN model, the features of unlabeled samples are mixed with those of nearby labeled samples, and propagated over the graph through multiple layers. In this paper, we choose the simplified graph convolutional networks (Wu et al., 2019) for its computational efficiency. Note that other GNN can also be considered, such as graph attention networks (Veličković et al., 2017), but it is not the focus of this work.

2.4 Label propagation for Deep SSL

Recent approaches(Iscen et al., 2019; Yang et al., 2016; Douze et al., 2018; Stretcu et al., 2019; Thekumparampil et al., 2018; Zhou et al., 2004) revisit the label propagation(Iscen et al., 2019) algorithm to leverage unlabeled data. The goal of label propagation is to extend the labeled data via diffusing limited labeled data to the unlabeled data. (Douze et al., 2018) employs label propagation on an approximate k-nearest neighbor graph for few-shot learning. And (Liu and Chang, 2009)classifies the test images requiring the graph constructing on the entire dataset. (Luo et al., 2018) proposes a graph as a similarity measure with respect to predicted features. Our work is different in that we inject a progressive scheme to train a GNN labeler by enlarging the labeled set using selected pseudo-labeled data with large indegree and high confidence, which are representative and reliable in the sample space.

3 Progressive Representative Labeling

In deep semi-supervised learning (SSL), the learning algorithm has access to a small labeled set along with a large unlabeled set (). The goal of our approach is to sample a representative unlabeled subset from , where the samples cover as much as possible area of entire data space (as shown in Figure(b)b). To this end, we build a directed kNN graph with all data and characterize the density of the data by its indegree. Moreover, we propose a progressive representative labeling (PRL) approach to pseudo-label representative samples for expanding the labeled set as the Figure  4. We assign hard labels to samples in and combine advanced consistency training loss along with the cross-entropy loss on the , , and the remaining non-representative unlabeled set .

Figure 4: The overview of our proposed progressive representative labeling (PRL) based on indegree sampler. The grey line shows the progressive process of collecting representative unlabeled samples. Nodes in the directed kNN graph are sorted according to indegrees. Top-ranked nodes (samples) are added to .

3.1 Representative Selection

The boundary assumption (Chapelle and Zien, 2005) indicates that it is more likely that the decision boundary locates at the low-density region of a cluster, which means that nodes lying in the high-density region are often reliable. On the other hand, high-density nodes are also more representative of the cluster than those in the sparse region. Motivated by this observation, we propose a novel indegree sampler to select representative unlabeled data. Concretely, a directed kNN graph is built based on the features of all samples in , , where are nodes corresponding to samples, is an adjacent matrix encoding graph edges, and are nodes features. The direction of every edge is from a node to its top-k nearest neighbors, which can be single-directional or bi-directional. The edge construction of can be expressed in the calculation of as


where is the k-nearest neighbors of node . The indegree of each node is the count of itself being neighbors of other nodes, which is presented as


All nodes are sorted according to indegrees. As shown in Figure (b)b, high indegree samples are among the kNN of other samples and lie in the high density region of the data space. Therefore, the indegree sampler helps selecting the most representative samples that can capture the intrinsic structure of the dataset. After selecting the nodes with large indegree, we employ a GNN labeler to pseudo-label those nodes. As shown in Figure 3, the selected samples by indegree sampler well cover the whole data space while the regular confidence thresholding sampler focuses merely on the samples around the labeled data..

1:  Build a kNN graph

with features extracted by a deep feature extractor such as CNN;

2:  Based on , train GNN labeler with ;
3:  Assign running representative unlabeled set ;
4:  while Size of Target do
5:     Calculate indegrees for nodes of ;
6:     Sort nodes according to indegrees;
7:     Except samples in the running , add selected samples into the running ;
8:     Finetune GNN labeler, update node features, update kNN graph ;
9:  end while
10:  Pseudo-label with GNN labeler.
Algorithm 1 Progressive Representative Labeling

3.2 Representative Labeling with Progressive GNN

As shown in Algorithm 1, representative labeling is designed in a progressive learning manner. In first progressive step, the selected representative samples are supposed to be easy samples with reliable pseudo-labels. These reliable pseudo-labeled samples are supposed to be able to refine features and kNN graph , thus be helpful for mining hard representative unlabeled samples.

GNN labeler

Graph neural network (GNN) (Scarselli et al., 2008) have been widely used in SSL for its superiority on modeling the graph-based relation. In this work, we employ GNN (Scarselli et al., 2008) on graph to predict pseudo-labels for representative samples. We adopt following steps to label representative samples. 1) First, we build the directed graph using the features of . 2) Second, the GNN is trained the supervision of with cross-entropy loss. 3) Finally, we assign hard labels for representative samples based on the GNN prediction and filter out the samples with confidence lower than threshold .

Specifically, we employ a two-layer Simplified Graph Convolutional (SGC) network  (Wu et al., 2019) as the GNN labeler. The SGC network takes the nodes feature and adjacent matrix as input and outputs new feature for each node. The forward computation in each layer of the SGC network is formulated as:


where , and is a diagonal matrix with diagonal entries as . are the input and output of SGC layer, are learnable parameters.

Progressive GNN

The above shows the first labeling process by GNN. We then plug the GNN into a progressive process. Concretely, we denote the GNN labeler in iteration as and the input feature for node is computed as:


where is a CNN model trained on , is the image of node . Then we train the GNN by the expanded set , where is the representative set collected from iteration 0 to iteration . A new set will be selected from the remaining unlabeled set. We repeat the procedure for several time until the size of ( is the total number of progressive iterations) reaches its target. Instead of labeling all samples simultaneously, the GNN gradually enlarge the labeled set by adding nodes from high-density (certain) region to low-density (certain) region in a progressive learning manner. Meanwhile, it is more accurate to infer the low-density nodes when gradually adding high-density nodes to train the GNN.

3.3 Our Deep SSL Framework

The pipeline of our framework is shown in Algorithm  2. First, we train a DNN model as a feature extractor on . Second, our representative labeling approach utilizes a progressive GNN to collect and pseudo-label a representative set . Finally, we finetune the model by consistency loss along with the cross-entropy loss by using , and the remaining unlabeled set . The objective function in this phase is formulated as:


where and

are the probability vector of sample

with a weak augmentation (e.g., random crop) and strong augmentation (e.g., random augment). and

means the cross entropy loss and Kullback-Leibler divergence loss function, defined as following:


where and are the probalitity of the th category. In fact, the PRL is a lightweight component compared to the initial DNN training and final model finetuning and is flexible to any semi-supervised framework.

1:  Train an initial CNN with labeled data , as the feature extractor;
2:  Build an directed kNN graph with extracted features on labeled data and unlabeled data ;
3:  Collect representative unlabeled set using the proposed indegree-sampler and progressively train a GNN labeler;
4:  Pseudo-label with the GNN labeler;
5:  Finetune the CNN model with , and as Equation (LABEL:eq:loss_Pseudo-labeling+)
Algorithm 2 Deep SSL framework

4 Experiments

4.1 Datasets

In the semi-supervised learning setting, the entire training set will be split into two parts. A small portion of training images are treated as labeled data and the rest are as unlabeled data. We conduct experiments on the following three standard semi-supervised image classification benchmarks:

  • For CIFAR-10 (Krizhevsky and Hinton, 2010), we provide both settings of 1% labeled set and 10% labeled set. Under 1% labeled setting, we have 250 labeled data and 49,750 unlabeled data, with each class having only 25 labeled data. For 10% labeled setting, the number of labeled and unlabeled data reaches 4,000 and 46,000, respectively.

  • Street View House Numbers (SVHN) (Netzer et al., 2011) gets a similar settings with CIFAR-10, as we have 250 labeled data and 72,007 unlabeled data in the setting of SVHN (250), and 1,000 labeled with 71,257 unlabeled data for the setting of SVHN (1000).

  • ILSVRC-2012 (Deng et al., 2009) is adapted for semi-supervised learning setting following previous work (Zhai et al., 2019; Yalniz et al., 2019). We use 10% of the labeled data (roughly 12,8000 samples) of the ImageNet (Deng et al., 2009) dataset and use the rest samples as unlabeled data.

4.2 Implementation Details

Following previous works (Xie et al., 2019a; Berthelot et al., 2019; Tarvainen and Valpola, 2017; Miyato et al., 2018), we employ Wide-ResNet-28-2 (Zagoruyko and Komodakis, 2016) as the base model for CIFAR-10 (Krizhevsky and Hinton, 2010) and SVHN (Netzer et al., 2011), and ResNet-50 (He et al., 2016) for ImageNet (Deng et al., 2009). In the progressive representative labeling (PRL), we consider k=5 neighbors for each sample to construct the kNN graph and we stack two SGC layers as the SGC labeler where the dimension of the hidden state is set as 64. Besides, we progressively train the PRL for iterations and the indegree sampler selects 30%, 40%, and 50% representative samples by indegree in the order of most to least with a confidence threshold for each iteration step, respectively. In the next section, we will analyze the progressive iterations

in details. For fair comparison, we use the same training protocols, including data preprocessing, learning rate schedule and the optimizer across all SSL methods. In details, we implement our progressive representative labeling approach in PyTorch 

(Paszke et al., 2017) running with TITAN X GPU. With one GPU, our PRL costs only half an hour with Faiss (Johnson et al., 2019) tools for billion-scale nearest neighbour search, which is negligible compared to finetuning the CNN model with 8 GPUs that costs nearly 24 hours.

Supervised Pseudo-labels VAT VAT-EM LLP UDA (w/ Aug) Ours
   Top-1 Accuracy - - - - - - 68.78 72.08
   Top-5 Accuracy 80.43 82.41 82.78 83.39 83.83 88.53 88.80 90.75
Table 1: Top-1 and top-5 accuracy on ImageNet (Deng et al., 2009) validation set with only 10% labeled data
 Method CIFAR-10 SVHN
250 labels 4000 labels 250 labels 4000 labels
 Pseudo-Label (Lee, 2013) 49.780.43 16.090.28 20.211.09 7.620.29
 II Model (Laine and Aila, 2016) 54.263.79 14.010.38 18.961.92 7.540.36
 Mean Teacher (Tarvainen and Valpola, 2017) 47.324.71 10.360.25 6.452.23 3.750.10
 VAT (Miyato et al., 2018) 36.032.82 13.860.27 8.411.01 5.630.20
 VAT+EntMin (Miyato et al., 2018) 36.322.13 13.130.39 8.150.97 5.350.19
 ICT (Verma et al., 2019) 13.280.42 7.660.17 4.310.17 3.530.07
 MixMatch (Berthelot et al., 2019) 11.080.87 6.240.10 3.780.26 3.270.31
 UDA (Xie et al., 2019a) 8.821.08 5.290.25 5.692.76 2.670.10
 UDA + Ours (PRL) 7.540.35 5.120.09 4.730.67 2.390.11
Table 2: Test error rates of semi-supervised learning methods for CIFAR-10 (Krizhevsky and Hinton, 2010) and SVHN (Netzer et al., 2011) on 5 different folds using Wide ResNet-28-2 (Zagoruyko and Komodakis, 2016) network with different label ratios

4.3 Comparison with State-of-the-Art

Performance on ImageNet

We first evaluate our method on the large and complex dataset ImageNet which uses 10% of the labeled data of the dataset and treats the others as unlabeled data. We use ResNet-50 (He et al., 2016) network as our base model to extract initial features. All the competitors in Table  1 also use ResNet-50 as the backbone. It can be seen that our approach achieves new state-of-the-art performance on ImageNet (Deng et al., 2009), up to top-1 accuracy of 72.08% and top-5 accuracy of 90.75%. Note that UDA (Xie et al., 2019a) applies the similar consistency regularization with a weak and strong augmentation techniques and obtains previous leading performance. Comparatively, we exceed UDA (Xie et al., 2019a) by a margin of 3.30% and 1.95% in top-1 and top-5 accuracy, respectively. Compared to the label propagation method LLP (Zhuang et al., 2019) on large-scale dataset, our approach achieves a significant improvement of 2.22% in top-5 accuracy.

Performance on CIFAR-10 and SVHN

Following the standard settings, we conduct experiments on 5 different folds of labeled data and report the mean test error rate with the variance. We apply Wide-ResNet-28-2 

(Zagoruyko and Komodakis, 2016) in our experiments. As shown in Table 2, aided by the progressive GNN labeler, we achieve state-of-the-art performance compared with UDA (Xie et al., 2019a) on CIFAR-10 and SVHN benchmarks with different label ratios. For example, with 250 labels of CIFAR-10 and SVHN, we obtain 7.54% and 4.73% test error rate better than UDA (Xie et al., 2019a) by 1.28% and 0.96%, respectively. The results above significantly show the superiority and effectiveness of our PRL approach.

5 Ablation Study

5.1 Effect of indegree sampler

In order to analyze the effect of our proposed indegree sampler, we compare with three competitors: sampler-free method (denoted as None) who does not select a portion of unlabeled data for training. On the contrary, it assigns a hard label to every unlabeled sample with maximum predictions; confidence-thresholding (denoted as Conf. Thres.) that ranks all samples by their maximum predicted probabilities and selects the samples whose maximum probability larger than a threshold of ; class-wise confidence top-k (denoted as Class-wise Conf. Top-k) that ranks samples in each class independently by their probabilities of corresponding class and select k samples equally for each class (Yalniz et al., 2019). For fair comparison, we exclude the influence of the number of samples, and fix the number of the total selected samples by different samplers. As shown in Table  3, indegree sampler achieves stably better performance than all other samplers. Especially, when the leverage our progressive GNN as labeler, indegree sampler significantly boosts the performance and outperforms the class-wise sampler by an improvement of 1.2% referring to 72.1% vs. 70.9% while has a significant improvement of 2.7% compared to the confidence thresholding sampler. It proves that our proposed sampler has more beneficial effect on searching for representative and reliable samples.

 Labeler Sampler
None Conf. Thres. Class-wise Conf. Top-k Ours (Indegree)
 CNN 68.5 68.3 70.2 70.3
 Ours 70.1 69.4 70.9 72.1
Table 3: Comparison between different samplers on Top-1 accuracy trained by 10%-labeled ImageNet
 Sampler Labeler
CNN Label Propagation GNN Progressive GNN
 Ours (Indegree) 70.3 70.8 71.2 72.1
Table 4: Comparison between different labelers on Top-1 accuracy trained by 10%-labeled ImageNet
Figure 5: Visualization of examples predicted by baseline CNN model and our GNN model. We show the max probability and its corresponding class. ‘GT’ refers to the ground-truth label. It can be shown that our GNN model helps improve the under-confident examples and correct the wrong predictions. Best viewed in color.
Figure 6: Progressive learning with iterations using different samplers

5.2 Effect of GNN labeler

Since we have only a small amount of labeled data for base model training, generalization ability of the model may be not reliable enough to guarantee a good performance on the unlabeled data. The pseudo-labels predicted by the base model is sometimes ambiguous or under-confident especially for the samples located near the decision boundary. Thus, the GNN labeler is proposed to incorporate the neighbor information to improve the labeling process. To verify the effectiveness of our GNN labeler, we compare two labeling methods: 1) CNN labeler which directly uses the one-hot labels predicted by CN N model and 2) Label Propagation (Iscen et al., 2019) which propagates labels from labeled data to unlabeled data via a constructed graph. We also compare between initial GNN labeler without progressive style and our completed Progressive GNN labeler. All the comparisons are based on indegree sampler who selects representative unlabeled samples for labeling.

The performance is reported in Table 4. Comparing to CNN labeler, GNN labeler leads to more than 0.9% improvements on top-1 accuracy regardless of progressive operation. The possible reason is that the graph-based labeler can refine the feature and avoid introducing noisy samples. Further, label propagation can also improve the performance to 70.8% compared to CNN labeler, which further demonstrates the power of the neighbor message. Nevertheless, GNN still possesses 0.4% advantage over label propagation, and the dominance even becomes 1.8% with progressive operation, indicating the effectiveness of our labeler component.

Furthermore, from Figure 5, it can be shown that our GNN labeler boosts the performance in two aspects: Firstly, GNN can help increase the confidence for those under-confident samples; Secondly, GNN can also help correct the wrong predictions, which is presented in the second row in the figure.

5.3 Effect of progressive learning

We propose a progressive learning scheme according to Section 3. Using indegree sampler, we feed the representative samples to regularize the GNN labeler so that the labeler can provide feature with high quality to the sampler in a mutually promoting fashion. The progressive manner between labeler and sampler is repeated for iterations. To illustrate the gain of progressive learning, we conduct an ablation study on the progressive iterations. For fair comparison, the number of representative samples to expand labeled set is required to the same. At , the labeler is trained by labeled data together with 30% representative samples selected by indegree sampler. Every further step will constantly adds 10% of unlabeled data size using indegree sampler and labeler. The results are presented in Figure 6. The accuracy rises as the progressive iteration increases and reaches the climax at , up to 72.1% along with 30%, 40% and 50% of unlabeled sampled to expand the labeled set. Similar phenomenon can be observed when using simpler confidence thresholding sampler with threshold of , but a lagre margin between two curves shows the superior of our GNN labeler again. Based on the climax, in our experiment, is defaultly setted to . The results well verify the effectiveness of the progressive learning scheme.

     Labeler Sampler Finetuning Top-1
     ✗ 60.3
     ✗ 65.1
     ✓ 68.5
     ✓ 71.3
Table 5: Analysis of the components in our learning pipeline. ‘finetuning’ means using the consistency regularization along with the cross-entropy loss on unlabeled data and expanded labeled set. In these experiments, we use ‘class-wise’ sampler to select pseudo-labeled data and SGC model as the labeler

5.4 Benefits of learning pipeline

We further analyse the contributions of different components in the learning pipeline. In Table  5, for the CNN baseline, we directly apply finetune on the labeled data and unlabeled data. ‘Finetuning’ means using the consistency regularization along with the cross-entropy loss on non-representative unlabeled data and expanded labeled set. As shown, we improve 4.8% of top-1 accuracy by applying ‘class-wise’ sampler to pseudo samples and finetuning referring to the baseline CNN. Equipped with a SGC labeler, a further gain is obtained, up to 6.2% improvement. The result further indicates after appling the SGC labeler, more representative samples will be pseudo-labeled and added to the labeled set to help improve the generalization of model.

6 Conclusion

Pseudo-labeling is a simple yet effective method for SSL. We propose a progressive representative labeling scheme to determine which part of unlabeled data to be pseudo-labeled and how to perform the pseudo-labeling. Indegree sampler is designed to select representative samples and has been verified to perform better than the confidence-thresholding sampler and the top-k confidence-ranking sampler. A GNN labeler cooperates with the progressive indegree sampler to refine confidence feature and improve pseudo-labeling ability. Specifically, the progressive updating GNN labeler is much more efficient than progressive updating CNN labeler. In the experiments, we demonstrate the effectiveness of the components including indegree sampler, GNN labeler, and progressive learning manner in our representative labeling. Especially, our sampler based on kNN indegree makes essential contributions to our pipeline by selecting representative unlabeled samples. In addition, our progressive representative labeling approach is orthogonal to the consistency training approach, like UDA (Xie et al., 2019a). The Extensive experiments on deep SSL benchmarks show our state-of-the-art performance and demonstrate the superiority and the effectiveness of our method.


  • D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) Mixmatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 5050–5060. Cited by: §1, §2.2, §4.2, Table 2.
  • O. Chapelle and A. Zien (2005) Semi-supervised classification by low density separation.. In AISTATS, Vol. 2005, pp. 57–64. Cited by: §1, §3.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §1, 3rd item, §4.2, §4.3, Table 1.
  • M. Douze, A. Szlam, B. Hariharan, and H. Jégou (2018) Low-shot learning with large-scale diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3349–3358. Cited by: §2.4.
  • Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.2, §4.3.
  • A. Iscen, G. Tolias, Y. Avrithis, and O. Chum (2019) Label propagation for deep semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5070–5079. Cited by: §1, §2.4, §5.2.
  • J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data. Cited by: §4.2.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.3.
  • A. Krizhevsky and G. Hinton (2010)

    Convolutional deep belief networks on cifar-10

    Unpublished manuscript 40 (7), pp. 1–9. Cited by: 1st item, §4.2, Table 2.
  • S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §1, §2.2, Table 2.
  • D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3, pp. 2. Cited by: §1, §2.1, Table 2.
  • Q. Li, Z. Han, and X. Wu (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.3.
  • W. Liu and S. Chang (2009) Robust multi-class transductive learning with graphs. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 381–388. Cited by: §2.4.
  • Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang (2018) Smooth neighbors on teacher graphs for semi-supervised learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8896–8905. Cited by: §2.4.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §1, §2.2, §4.2, Table 2.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: 2nd item, §4.2, Table 2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.2.
  • C. Rosenberg, M. Hebert, and H. Schneiderman (2005) Semi-supervised self-training of object detection models.. WACV/MOTION 2. Cited by: §1, §2.1.
  • M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in neural information processing systems, pp. 1163–1171. Cited by: §1, §2.2.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §3.2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • O. Stretcu, K. Viswanathan, D. Movshovitz-Attias, E. Platanios, S. Ravi, and A. Tomkins (2019) Graph agreement models for semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 8710–8720. Cited by: §2.4.
  • A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §4.2, Table 2.
  • K. K. Thekumparampil, C. Wang, S. Oh, and L. Li (2018) Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735. Cited by: §2.4.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.3.
  • V. Verma, A. Lamb, J. Kannala, Y. Bengio, and D. Lopez-Paz (2019) Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825. Cited by: Table 2.
  • F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger (2019) Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153. Cited by: §1, §2.3, §3.2.
  • Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019a) Unsupervised data augmentation for consistency training. Cited by: §1, §2.2, §4.2, §4.3, §4.3, Table 2, §6.
  • Q. Xie, E. Hovy, M. Luong, and Q. V. Le (2019b) Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252. Cited by: §2.1.
  • I. Z. Yalniz, H. Jégou, K. Chen, M. Paluri, and D. Mahajan (2019) Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546. Cited by: §1, §2.1, 3rd item, §5.1.
  • Z. Yang, W. W. Cohen, and R. Salakhutdinov (2016) Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Cited by: §2.4.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.2, §4.3, Table 2.
  • X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer (2019) S4l: self-supervised semi-supervised learning. In Proceedings of the IEEE international conference on computer vision, pp. 1476–1485. Cited by: 3rd item.
  • D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf (2004) Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328. Cited by: §2.4.
  • X. J. Zhu (2005) Semi-supervised learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §1, §2.
  • C. Zhuang, X. Ding, D. Murli, and D. Yamins (2019) Local label propagation for large-scale semi-supervised learning. arXiv preprint arXiv:1905.11581. Cited by: §1, §4.3.