A Simple Semi-Supervised Learning Framework for Object Detection

05/10/2020 ∙ by Kihyuk Sohn, et al. ∙ Google 3

Semi-supervised learning (SSL) has promising potential for improving the predictive performance of machine learning models using unlabeled data. There has been remarkable progress, but the scope of demonstration in SSL has been limited to image classification tasks. In this paper, we propose STAC, a simple yet effective SSL framework for visual object detection along with a data augmentation strategy. STAC deploys highly confident pseudo labels of localized objects from an unlabeled image and updates the model by enforcing consistency via strong augmentations. We propose new experimental protocols to evaluate performance of semi-supervised object detection using MS-COCO and demonstrate the efficacy of STAC on both MS-COCO and VOC07. On VOC07, STAC improves the AP^0.5 from 76.30 to 79.08; on MS-COCO, STAC demonstrates 2x higher data efficiency by achieving 24.38 mAP using only 5 baseline that marks 23.86 <https://github.com/google-research/ssl_detection/>.



There are no comments yet.


page 3

page 8

page 14

Code Repositories


Semi-supervised learning for object detection

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semi-supervised learning (SSL) has received growing attention in recent years as it provides means of using unlabeled data to improve model performance when large-scale annotated data is not available. A popular class of SSL methods is based on “Consistency-based Self-Training” [27, 38, 25, 44, 54, 35, 4, 58, 3, 59, 49]

. The key idea is to first generate the artificial labels for the unlabeled data and train the model to predict these artificial labels when feeding the unlabeled data with semanticity-preserving stochastic augmentations. The artificial label can either be a one-hot prediction (hard) or the model’s predictive distribution (soft). The other pillar for the success of SSL is from advancements in data augmentations. Data augmentations improve the robustness of deep neural networks 

[48, 24] and has been shown to be particularly effective for consistency-based self-training [58, 3, 59, 49]. The augmentation strategy spans from a manual combination of basic image transformations, such as rotation, translation, flipping, or color jittering, to neural image synthesis [68, 21, 62]

and policies learned by reinforcement learning 

[6, 69]. Lately, complex data augmentation strategies, such as RandAugment [7] or CTAugment [3], have turned out to be powerful for SSL on image classification [58, 3, 49, 59].

While having made remarkable progress, SSL methods have been mostly applied to image classification, whose labeling cost is relatively cheaper compared to other important problems in computer vision, such as object detection. Due to its expensive labeling cost, object detection demands a higher level of label efficiency, necessitating the development of strong SSL methods. On the other hand, the majority of existing works on object detection has focused on training a stronger 

[47, 28, 29] and faster [15, 41, 8] detector given sufficient amount of annotated data, such as MS-COCO [30]. Few existing works on SSL for object detection [53, 34, 43] rely on additional context, such as categorical similarities of objects or temporal consistency from video.

Figure 2: The proposed SSL framework for object detection. We generate pseudo labels (i.e., bounding boxes and their class labels) for unlabeled data using test-time inference, including non-maximum suppression (NMS) [14], of the teacher model trained with labeled data. We compute unsupervised loss with respect to pseudo labels whose confidence scores are above a threshold . The strong augmentations are applied for augmentation consistency during the model training. Target boxes are augmented when global geometric transformations are used.

In this work, we leverage lessons learned from deep SSL on image classification to tackle SSL for object detection. To this end, we propose a SSL framework for object detection that combines self-training (via pseudo label) [46, 33] and consistency regularization based on the strong data augmentations [6, 7, 69]. Inspired by the framework in Noisy-Student [59], our system contains two stages of training. In the first stage, we train an object detector (e.g., Faster RCNN [41]) using all labeled data until convergence. The trained detector is then used to predict bounding boxes and class labels of localized objects for unlabeled images as shown in Figure 2. Then, we apply confidence-based box filtering to each predicted box (after non-maximum suppression) with high threshold value to obtain pseudo labels with high precision, inspired by the design of FixMatch [49]. In the second stage, the strong data augmentations are applied to each unlabeled image and the model is trained with labeled data and unlabeled data with its corresponding pseudo labels generated in the first stage. Encouraged by RandAugment [7] and its successful adaptation to SSL [58, 49] and object detection [69], we design our augmentation strategy for object detection, which consists of global color transformation, global or box-level [69] geometric transformations, and Cutout [10].

We test the efficacy of STAC on public datasets: MS-COCO [30] and PASCAL VOC [13]. We design new experimental protocols using MS-COCO dataset to evaluate the semi-supervised performance of object detection. We use 1, 2, 5 and 10% of labeled data as labeled sets and the remainder as unlabeled sets to evaluate the effectiveness of SSL methods in the low-label regime. In addition, following [37, 52], we evaluate using all labeled data as the labeled set and additional unlabeled data provided by MS-COCO as the unlabeled set. Following [23], we use trainval of VOC07 as the labeled set and that of VOC12 with or without unlabeled data of MS-COCO as unlabeled sets. While being simple, STAC brings significant gain in mAPs: 18.47 to 24.38 on 5% protocol, 23.86 to 28.64 on 10% protocol as in Figure 1, and 42.60 to 46.01 on PASCAL VOC.

The contribution of this paper is as follows:

  1. We develop STAC, a SSL framework for object detection that seamlessly extends the class of state-of-the-art SSL methods for classification based on self-training and augmentation-driven consistency regularization.

  2. STAC is simple and introduces only two new hyperparameters: the confidence threshold

    and the unsupervised loss weight , which do not require an extensive additional effort for tuning.

  3. We propose new experimental protocols for SSL object detection using MS-COCO and demonstrate the efficacy of STAC on MS-COCO and PASCAL VOC in Faster RCNN framework.

2 Related Work

Object detection is a fundamental computer vision task and has been extensively studied in the literature [14, 15, 41, 17, 28, 5, 39, 40, 31, 29]. Popular object detection frameworks include Region-based CNN (RCNN) [14, 15, 41, 17, 28], YOLO [39, 40], SSD [31], etc [26, 51, 11]. The progress made by existing works is mainly on training a stronger or faster object detector given sufficient amount of annotated data. There is growing interest in improving detectors using unlabeled training data through a semi-supervised object detection framework [53, 34]

. Before deep learning, the idea has been explored by

[42]. Recently, [23]

proposes a consistency-based semi-supervised object detection method, which enforces the consistent prediction of an unlabeled image and its flipped counterpart. Their method requires a more sophisticated Jensen-Shannon Divergence for consistency regularization computation. Similar ideas to consistency regularization have also been studied in the active learning settings for object detection 

[55]. [52] introduces a self-supervised proposal learning module to learn context-aware and noise-robust proposal features from unlabeled data. [37] proposes data distillation that generates labels by ensembling predictions of multiple transformations of unlabeled data. We argue that stronger semi-supervised detectors require further investigation of unsupervised objectives and data augmentations.

Semi-supervised learning (SSL) for image classification has been dramatically improved recently. Consistency regularization becomes one of the popular approaches among recent methods [2, 45, 25, 63, 58] and inspires [23] on object detection. The idea is to enforce the model to generate consistent predictions across label-preserving data augmentations. Some exemplars include Mean-Teacher [54], UDA [58], and MixMatch [4]. Another popular class of SSL is pseudo labeling [27, 2], which can be viewed as a hard version of consistency regularization: the model is performing self-training to generate pseudo labels of unlabeled data and thereby train randomly-augmented unlabeled data to match the respective pseudo labels (i.e. being consistent in predictions of the same unlabeled example). How to use pseudo labels is critical to the success of SSL. For instance, Noisy-Student [59]

demonstrates an iterative teacher-student framework that repeats the process of labeling assignments using a teacher model and then training a larger student model. This method achieves state-of-the-art performance on ImageNet classification by leveraging extra unlabeled images in the wild. FixMatch 

[49] demonstrates a simple algorithm which outperforms previous approaches and establishes state-of-the-art performance, especially on diverse small labeled data regimes. The key idea behind FixMatch is matching the prediction of the strongly-augmented unlabeled data to the pseudo label of the weakly-augmented counterpart when the model confidence on the weakly-augmented one is high. In light of the success of these methods, this paper exploits the effective usage of pseudo labeling and pseudo boxes as well as data augmentations to improve object detectors.

Data augmentations are critical to improve model generalization and robustness [6, 7, 19, 69, 64, 67, 10, 12, 20], especially gradually become a major impetus on semi-supervised learning  [4, 3, 58, 49]. Finding appropriate color transformations and geometric transformations of input spaces has been shown to be critical to improve generalization [6, 19]. However, most augmentations are mainly studied in image classification. The complexity of data augmentations for object detection is much higher than image classification [69], since global geometric transformations of data affect bounding box annotations. Some works have presented augmentation techniques for supervised object detection, such as MixUp [64, 65], CutMix [61], or augmentation strategy learning [69]. The recent consistency-based SSL object detection method [23] utilizes global horizontal flipping (weak augmentation) to construct the consistency loss. To the best of our knowledge, the impact of intensive data augmentations on semi-supervised object detection has not been well studied.

3 Methodology

3.1 Background: Unsupervised Loss in SSL

Formulating an unsupervised loss that leverages unlabeled data is the key in SSL. Many advancements in SSL for classification rely on some forms of consistency regularization [27, 38, 25, 44, 54, 35, 4, 58, 3, 59, 49]. Inspired by a comparison in [49], we provide a unified view of consistency regularization for image classification. For -way classification, the consistency regularization is written as follows:


where is an image, map into a -simplex, and
maps into a binary value.

measures a distance between two vectors. Typical choices include

distance and cross entropy. Here, represents the prediction of the model parameterized by , is the prediction target, and is the weight that determines the contribution of to the loss. As an example, pseudo labeling [27] has the following configurations:


We refer readers to the supplementary material for configurations of a comprehensive list of SSL methods. State-of-the-art SSL algorithms, such as Unsupervised Data Augmentation (UDA) [58] or FixMatch [49], apply strong data augmentation , such as RandAugment [58] or CTAugment [3], to the model prediction for improved robustness. Noisy-Student [59] applies diverse forms of stochastic noise to the model prediction, including input augmentations via RandAugment, and network augmentations via dropout [50] and stochastic depth [22]. While sharing similarities on the model prediction, they differ in that generates the prediction target as detailed in Appendix 0.C. Besides the use of soft or hard targets, Different from (2) and many aforementioned algorithms, Noisy-Student employs a “teacher” network other than to generates pseudo labels . Note that the teacher network is independent of the model at training and this gives a scalability and flexibility on the choice of network architectures or optimization such as learning schedules.

3.2 STAC: SSL for Object Detection

We develop a novel SSL framework for object detection, called STAC, based on the Self-Training (via pseudo label) and the Augmentation driven Consistency regularization. First, we adopt a stage-wise training of Noisy-Student [59] for its scalability and flexibility. This involves at least two stages of training, where in the first stage, we train a teacher model using all available labeled data, and in the second stage, we train STAC using both labeled and unlabeled data. Second, we use a threshold with a high value for the confidence-based thresholding inspired by FixMatch [49] to control the quality of pseudo labels in object detection, which is comprised of bounding boxes and their class labels. The steps for training STAC are summarized as follows:

  1. Train a teacher model on available labeled images.

  2. Generate pseudo labels of unlabeled images (i.e., bounding boxes and their class labels) using the trained teacher model.

  3. Apply strong data augmentations to unlabeled images, and augment pseudo labels (i.e. bounding boxes) correspondingly when global geometric transformations are applied.

  4. Compute unsupervised loss and supervised loss to train a detector.

3.2.1 Training a Teacher Model.

We develop our formulation based on the Faster RCNN [41]

as it has been one of the most representative detection framework. Faster RCNN has a classifier (CLS) and a region proposal network (RPN) heads on top of the shared backbone network. Each head has two modules, namely region classifiers (e.g., a (

)-way classifier for the CLS head or a binary classifier for the RPN head) and bounding box regressors. We present the supervised and unsupervised losses of the Faster RCNN for the RPN head for simplicity. The supervised loss is written as follows:


where is an index of the ground-truth bounding box and is an index of an anchor.

is the predictive probability of an anchor being positive,

is the 4-dimensional coordinates of an anchor. is the binary label of an anchor with respect to the box , is the ground-truth box coordinates of the box . To train an RPN, needs to be determined for all anchors, box pairs. Note that we define a loss per box for presentation clarity, which is slightly different from that in [41].

3.2.2 Generating Pseudo Labels.

We perform a test-time inference of the object detector from the teacher model to generate pseudo labels. That being said, the pseudo label generation involves not only the forward pass of the backbone, RPN and CLS networks, but also the post-processing such as non-maximum suppression (NMS). This is different from conventional approaches for classification where the confidence score is computed from the raw predictive probability. We use the score of each returned bounding box after NMS, which aggregates the prediction probabilities of anchor boxes. Using box predictions after NMS has an advantage over using raw predictions (before NMS) since it removes repetitive detection. However, this does not filter out boxes at wrong locations as visualized in Figure 2 and Figure 4(a).

3.2.3 Unsupervised Loss.

When given an unlabeled image and set of predicted bounding boxes and their confidence scores, we determine , a binary label of an anchor with respect to the pseudo box , for all anchor, box pairs. Let be the box coordinates of pseudo box . The unsupervised loss of STAC is written as follows:


where if the confidence score of the predicted box is higher than the threshold value and otherwise. Decomposing the loss formulation of the Faster RCNN into a sum of losses for individual boxes makes the conversion from classification (Equation (1)) to detection (Equation (4)) much more transparent. Also note that the unsupervised loss is masked per box instead of per image, which is well aligned with our intuition. Overall, the RPN is trained by jointly minimizing two losses as follows:


where is a strong data augmentation applied to an unlabeled image. Since some transformation operations are not invariant to the box coordinates (e.g., global geometric transformation [69]), the augmentation operator is applied on the pseudo box coordinates as well.

The loss formulation of STAC introduces two hyperparameters and . In the experiments, we find that and work well. Note that the consistency-based SSL object detection method in [23] requires sophisticated three-staged weighting schedule of that includes temporal ramp-up and ramp-down. On the contrary, our system demonstrates effective performance with a simple constant weighting schedule because our framework enforces the consistency using a strong data augmentation strategy.

Figure 3: Visualization of different types of augmentation strategies. From left to right: original image, color transformation, global geometric transformation, box-level geometric transformation, box-level geometric transformation, and Cutout.

3.2.4 Data Augmentation Strategy

The key factor for the success of consistency-based SSL methods, such as UDA [58] or FixMatch [49], is a strong data augmentation. While the augmentation strategy for supervised and semi-supervised image classification has been extensively studied [6, 7, 3, 58, 49], not much effort has been made yet for object detection. We extend the RandAugment for object detection used in [6] using the augmentation search space recently proposed by [69] (e.g., box-level transformation) along with the Cutout [10]. For completeness, we describe the list of transformation operations below. Each operation has a magnitude that decides the augmentation degree of strength.111The range of degrees is empirically chosen without tuning.

  1. Global color transformation (C): Color transformation operations in [7] and the suggested ranges of magnitude for each op are used.

  2. Global geometric transformation (G): Geometric transformation operations in [7], namely, x-y translation, rotation, and x-y shear, are used.222The translation range in percentage is [, ] of image widths or heights. The rotation and shear ranges are [, ] in degrees.

  3. Box-level transformation [69] (B): Three transformation operations from global geometric transformations are used, but with smaller magnitude ranges.333The translation range in percentage is [, ] of image widths or heights. The rotation and shear range is [, ] in degree.

For each image, we apply transformation operations in sequence as follows. First, we apply one of the operations sampled from C. Second, we apply one of the operations sampled from either G or B. Finally, we apply Cutout at multiple random locations444The number of Cutout regions is sampled from [1, 5], and the region size is sampled from [0%, 20%] of the short edge of the applied image. of a whole image to prevent a trivial solution when applied exclusively inside the bounding box. We visualize transformed images with aforementioned augmentation strategies in Figure 3.

4 Experiments

We test the efficacy of our proposed method on MS-COCO [30], which is one of the most popular public benchmarks for object detection. MS-COCO contains more than 118k labeled images and 850k labeled object instances from 80 object categories for training. In addition, there are 123k unlabeled images that can be used for semi-supervised learning. We experiment two SSL settings. First, we randomly sample 1, 2, 5 and 10% of labeled training data as a labeled set and use the rest of labeled training data as an unlabeled set. For these experiments, we create 5 data folds. 1% protocol contains approximately 1.2k labeled images randomly selected from the labeled set of MS-COCO. 2% protocol contains additional 1.2k images and 5, 10% protocol datasets are constructed in a similar way. Second, following [52], we use an entire labeled training data as a labeled set and additional unlabeled data as an unlabeled set. Note that the first protocol tests the efficacy of STAC when only few labeled examples are available, while the second protocol evaluates the potential to improve the state-of-the-art object detector with unlabeled data in addition to already a large-scale labeled data. We report the mAP over 80 classes.

We also test on PASCAL VOC [13] following [23]. The trainval set of VOC07, containing 5,011 images from 20 object categories, is used as a labeled training data, and 11,540 images from the trainval set of VOC12 are used for an unlabeled training data. The detection performance is evaluated on the test set of VOC07 and mAP at IoU of (AP) is reported in addition to the MS-COCO metric.

4.1 Implementation Details

Our implementation is based on the Faster RCNN and FPN library of Tensorpack [56]. We use ResNet-50 [18] backbone for our object detector models. Unless otherwise stated, the network weights are initialized by the ImageNet-pretrained model555http://models.tensorpack.com/FasterRCNN/ImageNet-R50-AlignPadding.npz at all stages of training.

Since the training of the object detector is quite involved, we stay with the default learning settings for all our experiments other than the learning schedule. Most of our experiments are conducted using the quick learning schedule666Please refer to Section 5.1 for the definition of different learning schedules. with an exception for 100% MS-COCO protocol.777https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN#results We find that the model’s performance is benefited significantly by longer training when more labeled training data and more complex data augmentation strategies are used. STAC introduces two new hyperparameters for the confidence threshold and for the unsupervised loss. We use and for all experiments except for the 100% protocol of MS-COCO, where we lower threshold to increase the recall of pseudo labels. We refer readers to Appendix 0.A for complete learning settings.

Methods 1% COCO 2% COCO 5% COCO 10% COCO 100% COCO
Supervised 9.050.16 12.700.15 18.470.22 23.860.81 37.63
Supervised 9.830.23 14.280.22 21.180.20 26.180.12 39.48
STAC 13.970.35 18.250.25 24.380.12 28.640.21 39.21
Table 1:

Comparison in mAPs for different methods on MS-COCO. We report the mean and standard deviation over 5 data folds for 1, 2, 5 and 10% protocols. “Supervised” refers to models trained on labeled data only, which then are used to provide pseudo labels for STAC. We train STAC with the

C+{B,G}+Cutout augmentation for unlabeled data. Models with are trained with the same augmentation strategy, but only with labeled data. See Section 4.2 for more details.
Methods VOC07 VOC07(AP)
Supervised 42.60 76.30
Supervised 43.40 78.21
STAC (+VOC12) 44.64 77.45
STAC (+VOC12 & COCO) 46.01 79.08
[23] (+VOC12 & COCO) - 75.1888We note that the number from [23] using ResNet101 with R-FCN while all the results from our implementation use ResNet50 with FPN.
Table 2: Comparison in mAPs for different methods on VOC07. We report both mAPs at IoU=, a standard metric for MS-COCO, as well as at IoU= (AP), since AP is a saturated metric as pointed out by [5]. For STAC, we follow [23] to have different level of unlabeled sources, including VOC12 and the subset of MS-COCO data with the same classes as PASCAL VOC.

4.2 Results

Since deep semi-supervised learning of visual object detectors has not been widely studied yet, we mainly compare STAC with the supervised models (i.e., models trained with labeled data only) for various experimental protocols using different data augmentation strategies. Table 1 summarizes the results. For 1, 2, 5 and 10% protocols, we train models with a quick learning schedule and report mAPs averaged over 5 data folds and their standard deviation. For 100% protocol, we employ standard with longer learning schedule and report a single mAP value for each model.

Firstly, we confirm the findings of [7] with varying amount of labeled training data that the RandAugment improves the supervised learning performance of a detector by a significant margin, 2.71 mAP at 5% protocol, 2.32 mAP at 10% protocol, and 1.85 mAP for 100% protocol, upon the supervised baselines with default data augmentation of resizing and horizontal flipping.

STAC further improves the performance upon stronger supervised models. We find it to be particularly effective for protocols with small labeled training data, showing 5.91 mAP improvement at 5% protocol and 4.78 mAP at 10% protocol. Interestingly, STAC is proven to be at least 2 more data efficient than the baseline models for both 5% (24.36 for STAC v.s. 23.86 for supervised model with 10% labeled training data) and 10% protocols (28.56 for STAC v.s. 28.63 for the supervised model with 20% labeled training data). For the 100% protocol, STAC achieves 39.21 mAP. This improves upon the baseline (37.63 mAP), but falls short of the supervised model with a strong data augmentation (39.48 mAP). We hypothesize that the pseudo label training benefits by a larger amount of unlabeled data relative to the size of labeled data and study its effectiveness with respect to the scale of unlabeled data in Section 5.

We have a similar finding for experiments on PASCAL VOC. In Table 2, the mAP of the supervised models increases from 42.6 to 43.4, and AP increases from 76.30 to 78.21. A large-scale unlabeled data from VOC12 and MS-COCO further improves the performance, achieving 46.01 mAP and 79.08 AP.

5 Ablation Study

We perform ablation study on the key components of STAC. The study analyzes the impact on the detector performance of 1) different data augmentations and learning schedule strategies, 2) different sizes of unlabeled sets, 3) the hyperparameters , coefficient for unsupervised loss, and , confidence threshold, and 4) quality of pseudo labels and their impact on the proposed STAC.

Augmentation C C+{G,B} C+{G,B}+Cutout
5% MS-COCO (quick) 18.67 20.13 20.78 21.16
10% MS-COCO (quick) 24.05 25.26 25.92 26.34
10% MS-COCO (standard) 19.74 21.40 24.24 24.65
100% MS-COCO (standard) 37.42 37.22 36.39 36.12
100% MS-COCO (standard, ) 37.88 38.91 38.73 38.57
100% MS-COCO (standard, ) 37.63 39.33 39.75 39.48
Table 3: mAPs of supervised models trained with different augmentation and learning schedules. We test on a single fold of 5% and 10% protocols. See Section 5.1 for more details. Bold text indicates the best number in each row.

5.1 Data Augmentation and Learning Schedule

In this section, we evaluate the performance of supervised detector models with different data augmentation strategies and learning rate schedules while varying the amount of training data. We consider different combinations of augmentation modules, including the default augmentations of horizontal image flip, color only (C), color followed by geometric or box-level transforms (C+{G,B}), and the one followed by Cutout (C+{G,B}+Cutout). For {G,B}, we sample randomly and uniformly between geometric and box-level transform modules for each image. We consider different learning schedules, including quick, standard, and standard (standard setting with times longer training). While the number of weight updates are the same, the quick schedule uses lower resolution image as an input and smaller batch size for training.

The summary results are provided in Table 3. With small amount of labeled training data, we observe an increasing positive impact on detector performance with more complex (thus stronger) augmentation strategies. The trend holds true with the standard schedule, but we find that the quick schedule is beneficial on the low-labeled data regime due to its fast training and less issue of overfitting. On the other hand, we observe that the network significantly underfits with our augmentation strategies when all labeled data is used for training. For example, with 100% labeled data, we achieve even lower mAP of with C+{G,B}+Cutout strategy than that of with default augmentations. We find that the issue can be alleviated by longer training. Moreover, while the performance with default augmentations saturates and starts to decrease as it is trained longer, the models with strong data augmentation start to outperform, demonstrating their effectiveness on training with large-scale labeled data.

STAC contains two key components: self-training and strong data augmentation. We also verify the importance of data augmentation in , which is in line with recent findings in SSL for image classification [49]. We evaluate the performance of STAC with the default augmentations (horizontal flip). On a single fold of 10% protocol, we observe a good improvement in mAP upon baseline model (from 24.05 to 26.27), but the gain is not as significant as STAC (29.00). On 100% protocol, we observe slight decrease in performance when trained with self-training only (from 37.63 to 37.57), while STAC achieves 39.21 in mAP.

5.2 Size of Unlabeled Data

While the importance of large-scale labeled data for supervised learning has been broadly studied and emphasized [9, 57, 30], the importance on the scale of unlabeled data for semi-supervised learning has been often overlooked [36]. In this study, we highlight the importance of large-scale unlabeled data in the context of semi-supervised object detector learning. We experiment with 5% and 10% labeled data of MS-COCO while varying the amount of unlabeled data from 1, 2, 4, and 8 times to that of the labeled data.

The summary results are given in Table 4. While there still exists the improvement in mAPs when STAC is trained with small amount of unlabeled data, the gain is less significant compared to that of supervised model with strong data augmentation. We observe clearly from Table 4 that STAC benefits from the larger amount of unlabeled training data. We make a similar observation from experiments on PASCAL VOC in Table 2, where the AP of STAC trained using trainval of VOC12 as unlabeled data achieves 77.45, which is lower than that of supervised model with strong augmentations (78.21). On the other hand, STAC trained with large amount of unlabeled data by combining VOC12 and MS-COCO achieves 79.08 AP. This analysis may explain the slightly lower mAP of STAC for 100% protocol of MS-COCO than that of the supervised model with strong data augmentation since the size of available unlabeled data is roughly the same as that of the labeled data.

Unlab. Size Sup. Sup. 1 2 4 8 Full
5% MS-COCO 18.67 21.16 19.81 20.79 22.09 23.14 24.49
10% MS-COCO 24.05 26.34 25.38 26.52 27.33 27.95 29.00
Table 4: mAPs of STAC trained with varying amount of unlabeled data. refers that the amount of unlabeled data is times larger than that of labeled data. We test on a single fold of 5% and 10% protocols.

5.3 Hyperparameters and

We study the impact of , a regularization coefficient for unsupervised loss, and , the confidence threshold. Specifically, we test the STAC with different values of and on a single fold of 10% protocol. The summary results are provided in Figure 4. Firstly, the best performance of STAC is obtained when and . We observe that the performance of STAC deteriorates when is too large () or too small (), but it improves upon strong baseline consistently for . When there is no confidence-based box filtering, the gain of STAC, if any, is marginal over the strong baseline. This is because lots of predicted boxes are indeed inaccurate, as shown in Figure 4(a). Using larger value of allows to have pseudo box labels with higher precision (i.e., remaining boxes whose confidence is higher than are accurate), as in Figure 4(e). However, if becomes too large, one would get a lower recall (e.g., bounding box at sofa in Figure 4(c) is filtered out in Figure 4(d)). Figure 4 shows that the high precision (i.e., larger value of ) is preferred to high recall (i.e., smaller value of ) on 10% protocol.

Figure 4: mAPs of STAC with different values of and . We test on a single fold of 10% protocol. Different colors represent mAPs of models with different values. “Sup” represents the mAP of supervised model with default augmentations and “Sup*” represents that with C+{G,B}+Cutout.
Figure 5: Visualization of predicted bounding boxes whose confidences are larger than for unlabeled data. Larger value of results in higher precision (e.g., remaining boxes after thresholding detect objects accurately), but lower recall (e.g., detected box at sofa is removed when ).

5.4 Quality of Pseudo Labels

One intriguing question is whether the semi-supervised performance of the model improves with pseudo labels of higher quality. To validate the hypothesis, we train two additional STAC models for 10% protocol, where models are provided pseudo labels predicted by two different supervised models trained with 5% and 100% labeled data, whose mAPs are 18.67 and 37.63, respectively. Note that the STAC on 10% protocol achieves 29.00 mAP. STAC trained with less accurate pseudo labels achieves only 24.25 mAP, while the one with more accurate pseudo labels achieves 30.30 mAP, confirming the importance of pseudo label quality.

Inspired by this observation, we increase the augmentation strength to train the teacher model in order to get better pseudo labels, expecting a further improvement for STAC. To this end, we train STAC using different sets of pseudo labels that are provided by the supervised models trained with different data augmentation schemes. As in Table 5, the performance of supervised models vary from mAP of 18.67 to 21.16 with 5% labeled data and from 24.05 to 26.34 with 10% labeled data. We observe an improvement in mAP by using more accurate pseudo labels on 5% protocol, but the gain is not as substantial. We also do not observe a clear correlation between the accuracy of pseudo label and the performance of STAC on 10% protocol. While STAC brings a significant gain in mAP using pseudo labels, our results suggest that the incremental improvement on the quality of pseudo labels may not bring in a significant extra benefit.

Protocol Augmentation C C+{G,B} C+{G,B}+Cutout
5% MS-COCO supervised 18.67 20.13 20.78 21.16
STAC 24.49 25.01 24.70 25.12
10% MS-COCO supervised 24.05 25.26 25.92 26.34
STAC 29.00 28.97 28.41 28.81
Table 5: mAPs of supervised models and STAC tested on a single fold of 5% and 10% protocols. We first train supervised models with different augmentation strategies (first row of each protocol), and pseudo labels generated form each supervised model are used to train STAC models (second row of each protocol) accordingly.

6 Discussion and Conclusion

While SSL for classification has made significant strides, not much effort has been put to date for detection that demands a higher level of label-efficient training. We propose a simple (introducing only two hyperparameters that are easy to tune) and effective (

label efficiency in low-label regime) SSL framework for object detection by leveraging lessons from SSL methods for classification. The simplicity of our method will provide a flexibility for further development towards solving SSL for object detection.

The proposed framework is amenable to many variations, including using soft labels for classification loss, other detector frameworks than Faster RCNN, and other data augmentation strategies. While STAC demonstrates an impressive performance gain already without taking confirmation bias [66, 1] issue into account, it could be problematic when using a detection framework with a stronger form of hard negative mining [47, 29]

because noisy pseudo labels can be overly-used. Further investigation in learning with noisy labels, confidence calibration, and uncertainty estimation in the context of object detection are few important topics to further enhance the performance of SSL object detection.


We thank Qizhe Xie, Ekin D. Cubuk, Sercan Arik, Minh-Thang Luong, David Berthelot, Tsung-Yi Lin, Quoc V. Le, Samuel Schulter for their comments.


  • [1] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness (2019) Pseudo-labeling and confirmation bias in deep semi-supervised learning. arXiv preprint arXiv:1908.02983. Cited by: §6.
  • [2] P. Bachman, O. Alsharif, and D. Precup (2014) Learning with pseudo-ensembles. In NeurIPS, Cited by: §2.
  • [3] D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel (2020) ReMixMatch: semi-supervised learning with distribution matching and augmentation anchoring. In ICLR, Cited by: §0.C.11, §0.C.11, §0.C.8, §1, §2, §3.1, §3.2.4.
  • [4] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) Mixmatch: a holistic approach to semi-supervised learning. In NeurIPS, Cited by: §0.C.10, §0.C.10, §1, §2, §2, §3.1.
  • [5] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In CVPR, Cited by: §2, Table 2.
  • [6] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In CVPR, Cited by: §1, §1, §2, §3.2.4.
  • [7] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2019) RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §0.C.7, §0.C.8, §1, §1, §2, item 1, item 2, §3.2.4, §4.2.
  • [8] J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In NIPS, Cited by: §1.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §5.2.
  • [10] T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    arXiv preprint arXiv:1708.04552. Cited by: §0.C.7, §0.C.8, §1, §2, §3.2.4.
  • [11] X. Du, T. Lin, P. Jin, G. Ghiasi, M. Tan, Y. Cui, Q. V. Le, and X. Song (2019) SpineNet: learning scale-permuted backbone for recognition and localization. arXiv preprint arXiv:1912.05027. Cited by: §2.
  • [12] D. Dwibedi, I. Misra, and M. Hebert (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In ICCV, Cited by: §2.
  • [13] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §1, §4.
  • [14] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, Cited by: Figure 2, §2.
  • [15] R. Girshick (2015) Fast r-cnn. In ICCV, Cited by: §1, §2.
  • [16] Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §0.C.2.
  • [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §2.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
  • [19] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan (2019) AugMix: a simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781. Cited by: §2.
  • [20] D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen (2019) Population based augmentation: efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393. Cited by: §2.
  • [21] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2017) Cycada: cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213. Cited by: §1.
  • [22] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. In ECCV, Cited by: §3.1.
  • [23] J. Jeong, S. Lee, J. Kim, and N. Kwak (2019) Consistency-based semi-supervised learning for object detection. In NeurIPS, Cited by: §1, §2, §2, §2, §3.2.3, Table 2, §4, footnote 8.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, Cited by: §1.
  • [25] S. Laine and T. Aila (2017) Temporal ensembling for semi-supervised learning. In ICLR, Cited by: §0.C.4, §0.C.4, §1, §2, §3.1.
  • [26] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In ECCV, Cited by: §2.
  • [27] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshops, Cited by: §0.C.3, §1, §2, §3.1.
  • [28] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §1, §2.
  • [29] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §1, §2, §6.
  • [30] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: Figure 1, §1, §1, §4, §5.2.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, Cited by: §2.
  • [32] D. McClosky, E. Charniak, and M. Johnson (2006) Effective self-training for parsing. In Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics, pp. 152–159. Cited by: §0.C.1.
  • [33] G. J. McLachlan (1975) Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association 70 (350), pp. 365–369. Cited by: §1.
  • [34] I. Misra, A. Shrivastava, and M. Hebert (2015) Watch and learn: semi-supervised learning for object detectors from video. In CVPR, Cited by: §1, §2.
  • [35] T. Miyato, S. Maeda, S. Ishii, and M. Koyama (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. T-PAMI. Cited by: §0.C.6, §1, §3.1.
  • [36] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow (2018) Realistic evaluation of deep semi-supervised learning algorithms. In NeurIPS, Cited by: §5.2.
  • [37] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He (2018) Data distillation: towards omni-supervised learning. In CVPR, Cited by: §1, §2.
  • [38] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko (2015) Semi-supervised learning with ladder networks. In NeurIPS, Cited by: §1, §3.1.
  • [39] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §2.
  • [40] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In CVPR, Cited by: §2.
  • [41] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §1, §1, §2, §3.2.1.
  • [42] C. Rosenberg, M. Hebert, and H. Schneiderman (2005) Semi-supervised self-training of object detection models. In Proc IEEE Workshops on Application of Computer Vision, Cited by: §2.
  • [43] A. RoyChowdhury, P. Chakrabarty, A. Singh, S. Jin, H. Jiang, L. Cao, and E. Learned-Miller (2019) Automatic adaptation of object detectors to new domains using self-training. In CVPR, Cited by: §1.
  • [44] M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Mutual exclusivity loss for semi-supervised deep learning. In ICIP, Cited by: §1, §3.1.
  • [45] M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NeurIPS, Cited by: §2.
  • [46] H. Scudder (1965)

    Probability of error of some adaptive pattern-recognition machines

    IEEE Transactions on Information Theory 11 (3). Cited by: §1.
  • [47] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In CVPR, Cited by: §1, §6.
  • [48] P. Y. Simard, D. Steinkraus, J. C. Platt, et al. (2003) Best practices for convolutional neural networks applied to visual document analysis.. In ICDAR, Cited by: §1.
  • [49] K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685. Cited by: §0.C.8, §1, §1, §2, §2, §3.1, §3.2.4, §3.2, §5.1.
  • [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR 15 (1), pp. 1929–1958. Cited by: §3.1.
  • [51] M. Tan, R. Pang, and Q. V. Le (2019) Efficientdet: scalable and efficient object detection. arXiv preprint arXiv:1911.09070. Cited by: §2.
  • [52] P. Tang, C. Ramaiah, R. Xu, and C. Xiong (2020) Proposal learning for semi-supervised object detection. arXiv preprint arXiv:2001.05086. Cited by: §1, §2, §4.
  • [53] Y. Tang, J. Wang, B. Gao, E. Dellandréa, R. Gaizauskas, and L. Chen (2016) Large scale semi-supervised object detection using visual and semantic knowledge transfer. In CVPR, Cited by: §1, §2.
  • [54] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, Cited by: §0.C.5, §0.C.5, §1, §2, §3.1.
  • [55] K. Wang, X. Yan, D. Zhang, L. Zhang, and L. Lin (2018) Towards human-machine cooperation: self-supervised sample mining for object detection. In CVPR, Cited by: §2.
  • [56] Y. Wu et al. (2016) Tensorpack. Note: https://github.com/tensorpack/ Cited by: §4.1.
  • [57] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010)

    Sun database: large-scale scene recognition from abbey to zoo

    In CVPR, Cited by: §5.2.
  • [58] Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §0.C.7, §1, §1, §2, §2, §3.1, §3.2.4.
  • [59] Q. Xie, E. Hovy, M. Luong, and Q. V. Le (2019) Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252. Cited by: §0.C.9, §1, §1, §2, §3.1, §3.2.
  • [60] D. Yarowsky (1995) Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, pp. 189–196. Cited by: §0.C.1.
  • [61] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In ICCV, Cited by: §2.
  • [62] S. Zakharov, W. Kehl, and S. Ilic (2019) Deceptionnet: network-driven domain randomization. In ICCV, Cited by: §1.
  • [63] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer (2019) Sl: self-supervised semi-supervised learning. In ICCV, Cited by: §2.
  • [64] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §0.C.10, §0.C.11, §2.
  • [65] Z. Zhang, T. He, H. Zhang, Z. Zhang, J. Xie, and M. Li (2019) Bag of freebies for training object detection neural networks. arXiv preprint arXiv:1902.04103. Cited by: §2.
  • [66] Z. Zhang, F. Ringeval, B. Dong, E. Coutinho, E. Marchi, and B. Schüller (2016) Enhanced semi-supervised learning for multimodal emotion recognition. In ICASSP, Cited by: §6.
  • [67] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. arXiv preprint arXiv:1708.04896. Cited by: §2.
  • [68] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In ICCV, Cited by: §1.
  • [69] B. Zoph, E. D. Cubuk, G. Ghiasi, T. Lin, J. Shlens, and Q. V. Le (2019) Learning data augmentation strategies for object detection. arXiv preprint arXiv:1906.11172. Cited by: §1, §1, §2, item 3, §3.2.3, §3.2.4.

Appendix 0.A Learning Schedules

In this section, we provide complete descriptions on different learning schedules used in our experiments. Note that the schedule VOC is only used for experiments related to PASCAL VOC. Besides specified below, we adopt the learning settings as follows: https://github.com/tensorpack/tensorpack/blob/master/examples/FasterRCNN/config.py.

0.a.1 Quick

  • LR Decay:

  • Data processing: Short edge size is sampled between 500 and 800 if the long edge is less than 1024 after resizing.

  • Batch per image for training Faster RCNN head: 64

0.a.2 Standard,

  • LR Decay:

  • LR Decay ():

  • LR Decay ():

  • Data processing: Short edge size is fixed to 800 if the long edge is less than 1333 after resizing.

  • Batch per image for training Faster RCNN head: 512

0.a.3 Voc

  • LR Decay:

  • Data processing: Short edge size is fixed to 600 if the long edge is less than 1000 after resizing. Image is resized to have its longer edge to be 1000 if long edge is longer than 1000.

  • Batch per image for training Faster RCNN head: 256

  • RPN Anchor Sizes:

Appendix 0.B Data Augmentation in STAC

This section provides more comprehensive results of Section 5.1 to validate the importance of data augmentation in STAC. In Table 6, we provide two rows of results with STAC (bottom) and the STAC without strong data augmentation, i.e., “Self-Training”. We observe significant gain in mAP on all cases, which validates the importance of the data augmentation in STAC.

Methods 5% COCO 10% COCO 100%
Self-Training 21.800.12 26.710.27 37.57
STAC 24.380.12 28.640.21 39.21
Table 6: Comparison in mAPs for different SSL methods on MS-COCO. We report the mean and standard deviation over 5 data folds for 5% and 10% protocols. “Self-Training” refers to STAC but without strong data augmentation on unlabeled data. We train STAC with the strong augmentation for unlabeled data.

Appendix 0.C Extended Background: Unsupervised Loss in SSL

In this section, we extend Section 3.1 and provide unsupervised loss formulations for comprehensive list of SSL algorithms whose loss can be represented in Equation (1). For presentation clarity, let us reiterate definitions as follows:


Here, we use instead of as in Equation (1) for generality. Instead, let us denote as a prediction of the model with parameters at training.

Note that the unsupervised loss formulation of STAC is following the form of Noisy Student (Section 0.C.9), which can be viewed as a combination of Self-Training (Section 0.C.1) and strong data augmentation. While we have shown such a simple formulation of STAC brings in a significant performance gain at object detection, more complicated formulations (e.g., Mean Teacher (Section 0.C.5) or MixMatch/ReMixMatch (Section 0.C.10)) are amenable to be used in place of several design choices made for STAC. Further investigation of STAC variants is in the scope of the future work.

0.c.1 Bootstrapping (a.k.a. Self-Training) [60, 32]


where is the parameter of the existing model, which usually refers to a model trained on labeled data only until convergence.

0.c.2 Entropy Minimization [16]


Note that gradient flows both to and

. To our best knowledge, Entropy Minimization is the only method that backpropagates the gradient through


0.c.3 Pseudo Labeling [27]


0.c.4 Temporal Ensembling [25]


We omit the ramp up and ramp down for in our formulation since it is dependent on the optimization framework. See [25] for more details.

0.c.5 Mean Teacher [54]


We omit the ramp up and ramp down for in our formulation since it is dependent on the optimization framework. See [54] for more details.

0.c.6 Virtual Adversarial Training [35]


0.c.7 Unsupervised Data Augmentation (UDA) [58]

UDA uses a weak (), such as translation and horizontal flip, to generate a pseudo label, and strong augmentation (), such as RandAugment [7] followed by Cutout [10], for model training.


0.c.8 FixMatch [49]

FixMatch also uses a weak (), such as translation and horizontal flip, to generate a pseudo label, and strong augmentation (), such as RandAugment [7] or CTAugment [3] followed by Cutout [10], for model training.


0.c.9 Noisy Student [59]


where is the parameter of the model that is trained on labeled data only until convergence. In addition, Noisy Student perform data balancing across classes, which is not reflected in this formulation.

0.c.10 MixMatch [4]

Note that MixMatch uses MixUp [64] for unsupervised loss. It uses weak augmentation , such as translation and horizontal flip.


where and are unlabeled data and

is drawn from Beta distribution. While we present MixUp only between unlabeled data for presentation clarity, one may apply MixUp between labeled (with ground-truth label for

) and unlabeled data as well [4].

0.c.11 ReMixMatch [3]

Note that ReMixMatch uses MixUp [64] for unsupervised loss. It also uses weak augmentation , such as translation and horizontal flip, and strong augmentation , such as CTAugment [3].


where and are unlabeled data and is drawn from Beta distribution. While we present MixUp only between unlabeled data for presentation clarity, one may apply MixUp between labeled (with ground-truth label for ) and unlabeled data as well [3].