Unbiased Teacher for Semi-Supervised Object Detection

02/18/2021 ∙ by Yen-Cheng Liu, et al. ∙ Facebook Georgia Institute of Technology 15

Semi-supervised learning, i.e., training networks with both labeled and unlabeled data, has made significant progress recently. However, existing works have primarily focused on image classification tasks and neglected object detection which requires more annotation effort. In this work, we revisit the Semi-Supervised Object Detection (SS-OD) and identify the pseudo-labeling bias issue in SS-OD. To address this, we introduce Unbiased Teacher, a simple yet effective approach that jointly trains a student and a gradually progressing teacher in a mutually-beneficial manner. Together with a class-balance loss to downweight overly confident pseudo-labels, Unbiased Teacher consistently improved state-of-the-art methods by significant margins on COCO-standard, COCO-additional, and VOC datasets. Specifically, Unbiased Teacher achieves 6.8 absolute mAP improvements against state-of-the-art method when using 1 labeled data on MS-COCO, achieves around 10 mAP improvements against the supervised baseline when using only 0.5, 1, 2

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

unbiased-teacher

PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The availability of large-scale datasets and computational resources has allowed deep neural networks to achieve strong performance on a wide variety of tasks. However, training these networks requires a large number of labeled examples that are expensive to annotate and acquire. As an alternative, Semi-Supervised Learning (SSL) methods have received growing attention 

(Sohn et al., 2020a; Berthelot et al., 2020, 2019; Laine and Aila, 2017; Tarvainen and Valpola, 2017; Sajjadi et al., 2016; Lee, 2013; Grandvalet and Bengio, 2005). Yet, these advances have primarily focused on image classification, rather than object detection where bounding box annotations require more effort.

In this work, we revisit object detection under the SSL setting (Figure 1): an object detector is trained with a single dataset where only a small amount of labeled bounding boxes and a large amount of unlabeled data are provided, or an object detector is jointly trained with a large labeled dataset as well as a large external unlabeled dataset. A straightforward way to address Semi-Supervised Object Detection (SS-OD) is to adapt from existing advanced semi-supervised image classification methods (Sohn et al., 2020a). Unfortunately, object detection has some unique characteristics that interact poorly with such methods. For example, the nature of class-imbalance in object detection tasks impedes the usage of pseudo-labeling. In object detection, there exists foreground-background imbalance and foreground classes imbalance (see Section 3.3). These imbalances make models trained in SSL settings prone to generate biased predictions. Pseudo-labeling methods, one of the most successful SSL methods in image classification (Lee, 2013; Sohn et al., 2020a), may thus be biased towards dominant and overly confident classes (background) while ignoring minor and less confident classes (foreground). As a result, adding biased pseudo-labels into the semi-supervised training aggravates the class-imbalance issue and introduces severe overfitting. As shown in Figure 2, taking a two-stage object detector as an example, there exists heavy overfitting on the foreground/background classification in the RPN and multi-class classification in the ROIhead (but not on bounding box regression).

(a)(b)
Figure 1: (a) Illustration of semi-supervised object detection, where the model observes a set of labeled data and a set of unlabeled data in the training stage. (b) Our proposed model can efficiently leverage the unlabeled data and perform favorably against the existing semi-supervised object detection works, including CSD (Jeong et al., 2019) and STAC (Sohn et al., 2020b).

To overcome these issues, we propose a general framework – Unbiased Teacher: an approach that jointly trains a Student and a slowly progressing Teacher in a mutually-beneficial manner, in which the Teacher generates pseudo-labels to train the Student, and the Student gradually updates the Teacher via Exponential Moving Average (EMA)222Note that there have been many works that leverages EMA, e.g., ADAM optimization (Kingma and Ba, 2015)

, Batch Normalization 

(Ioffe and Szegedy, 2015), self-supervised learning (He et al., 2020; Grill et al., 2020), and SSL image classification (Tarvainen and Valpola, 2017). We, for the first time, show its effectiveness in combating class imbalance issues and detrimental effect of pseudo-labels for the object detection task., while the Teacher and Student are given different augmented input images (see Figure 3). Inside this framework, (i) we utilize the pseudo-labels as explicit supervision for both RPN and ROIhead and thus alleviate the overfitting issues in both RPN and ROIhead. (ii) We also prevent detrimental effects due to noisy pseudo-labels by exploiting the Teacher-Student dual models (see further discussion and analysis in Section 4.2). (iii) With the use of EMA training and the Focal loss (Lin et al., 2017b), we can address the pseudo-labeling bias problem caused by class-imbalance and thus improve the quality of pseudo-labels. As the result, our object detector achieves significant performance improvements.

We benchmark Unbiased Teacher with SSL setting using the MS-COCO and PASCAL VOC datasets, namely

COCO-standard, COCO-additional, and VOC. When using only labeled data from MS-COCO (COCO-standard), Unbiased Teacher achieves absolute mAP improvement against the state-of-the-art method, STAC (Sohn et al., 2020b). Unbiased Teacher consistently achieves around 10 absolute mAP improvements when using only of labeled data compared to supervised baseline.

We highlight the contributions of this paper as follows:

  • By analyzing object detectors trained with limited-supervision, we identify that the nature of class-imbalance in object detection tasks impedes the effectiveness of pseudo-labeling method on SS-OD task.

  • We thus proposed a simple yet effective method, Unbiased Teacher, to address the pseudo-labeling bias issue caused by class-imbalance existing in ground-truth labels and the overfitting issue caused by the scarcity of labeled data.

  • Our Unbiased Teacher achieves state-of-the-art performance on SS-OD across COCO-standard, COCO-additional, and VOC datasets. We also provide an ablation study to verify the effectiveness of each proposed component.

2 Related Works

Figure 2: Validation Losses of our model and the model trained with labeled data only. When the labeled data is insufficient (1 and 5

), RPN and ROIhead classifiers suffer from overfitting, while RPN and ROIhead regression do not suffer from overfitting. Our model can significantly alleviates the overfitting issue in classifiers and also improves the validation box regression loss.

Semi-Supervised Learning. The majority of the recent SSL methods typically consist of (1) input augmentations and perturbations, and (2) consistency regularization. They regularize the model to be invariant and robust to certain augmentations on the input, which requires the outputs given the original and augmented inputs to be consistent. For example, existing approaches apply convention data augmentations (Berthelot et al., 2019; Laine and Aila, 2017; Sajjadi et al., 2016; Tarvainen and Valpola, 2017) to generate different transformations of the semantically identical images, perturb the input images along the adversarial direction (Miyato et al., 2018; Yu et al., 2019), utilize multiple networks to generate various views of the same input data (Qiao et al., 2018), mix input data to generate augmented training data and labels (Zhang et al., 2018; Yun et al., 2019; Guo et al., 2019; Hendrycks et al., 2020), or learn augmented prototypes in feature space instead of the image space (Kuo et al., 2020). However, the complexities in architecture design of object detectors hinder the transfer of existing semi-supervised techniques from image classification to object detection.

Semi-Supervised Object Detection.

Object detection is one of the most important computer vision tasks and has gained enormous attention 

(Lin et al., 2017a; He et al., 2017; Redmon and Farhadi, 2017; Liu et al., 2016)

. While existing works have made significant progress over the years, they have primarily focused on training object detectors with fully-labeled datasets. On the other hand, there exist several semi-supervised object detection works that focus on training object detector with a combination of labeled, weakly-labeled, or unlabeled data. This line of work began even before the resurgence of deep learning 

(Rosenberg et al., 2005). Later, along with the success of deep learning, Hoffman et al. (2014) and Gao et al. (2019) trained object detectors on data with bounding box labels for some classes and image-level class labels for other classes, enabling detection for categories that lack bounding box annotations. Tang et al. (2016) adapted the image-level classifier of a weakly labeled category (no bounding boxes) into a detector via similarity-based knowledge transfer. Misra et al. (2015) exploited a few sparsely labeled objects and bounding boxes in some video frames and localized unknown objects in the following videos.

Unlike their settings, we follow the standard SSL setting and adapt it to the object detection task, in which the training contains a small set of labeled data and another set of completely unlabeled data (i.e., only images). In this setting, Jeong et al. (2019) proposed a consistency-based method, which enforces the predictions of an input image and its flipped version to be consistent. Sohn et al. (2020b) pre-trained a detector using a small amount labeled data and generates pseudo-labels on unlabeled data to fine-tune the pre-trained detector. Their pseudo-labels are generated only once and are fixed through out the rest of training. While they can improve the performance against the model trained on labeled data, imbalance issue is not considered in existing SS-OD works. In contrast, our method not only improve the pseudo-label generation model via teacher-student mutual learning regimen (Sec. 3.2) but address the crucial imbalance issue in generated pseudo-labels (Sec. 3.3).

3 Unbiased Teacher

Problem definition. Our goal is to address object detection in a semi-supervised setting, where a set of labeled images and a set of unlabeled images are available for training. and are the number of supervised and unsupervised data. For each labeled image , the annotations contain locations, sizes, and object categories of all bounding boxes.

Figure 3: Overview of Unbiased Teacher. Unbiased Teacher consists of two stages. Burn-In: we first train the object detector using available labeled data. Teacher-Student Mutual Learning consists of two steps. Student Learning: the fixed teacher generates pseudo-labels to train the Student, while Teacher and Student are given weakly and strongly augmented inputs, respectively. Teacher Refinement: the knowledge that the Student learned is then transferred to the slowly progressing Teacher via exponential moving average (EMA) on network weights. When the detector is trained until converge in the Burn-In stage, we switch to the Teacher-Student Mutual Learning stage.

Overview. As shown in Figure 3, our Unbiased Teacher consists of two training stages, the Burn-In stage and the Teacher-Student Mutual Learning stage. In the Burn-In stage (Sec. 3.1), we simply train the object detector using the available supervised data to initialize the detector. At the beginning of the Teacher-Student Mutual Learning stage (Sec. 3.2), we duplicate the initialized detector into two models (Teacher and Student models). Our Teacher-Student Mutual Learning stage aims at evolving both Teacher and Student models via a mutual learning mechanism, where the Teacher generates pseudo-labels to train the Student, and the Student updates the knowledge it learned back to the Teacher; hence, the pseudo-labels used to train the Student itself are improved. Lastly, there exists class-imbalance and foreground-background imbalance problems in object detection, which impedes the effectiveness of semi-supervised techniques of image classification (e.g., pseudo-labeling) being used directly on SS-OD. Therefore, in Sec. 3.3, we also discuss how Focal loss (Lin et al., 2017b) and EMA training alleviate the imbalanced pseudo-label issue.

3.1 Burn-In

It is important to have a good initialization for both Student and Teacher models, as we will rely on the Teacher to generate pseudo-labels to train the Student in the later stage. To do so, we first use the available supervised data to optimize our model with the supervised loss . With the supervised data , the supervised loss of object detection consists of four losses: the RPN classification loss , the RPN regression loss , the ROI classification loss , and the ROI regression loss  (Ren et al., 2015),

(1)

After Burn-In, we duplicate the trained weights for both the Teacher and the Student models (). Starting from this trained detector, we further utilize the unsupervised data to improve the object detector via the following proposed training regimen.

3.2 Teacher-Student Mutual Learning

Overview. To leverage the unsupervised data, we introduce the Teacher-Student Mutual Learning regimen, where the Student is optimized by using the pseudo-labels generated from the Teacher, and the Teacher is updated by gradually transferring the weights of continually learned Student model. With the interaction between the Teacher and the Student, both models can evolve jointly and continuously to improve detection accuracy. With the improvement on detection accuracy, this also means that the Teacher generates more accurate and stable pseudo-labels, which we identify as one of the keys for large performance improvement compared to existing work (Sohn et al., 2020b). In another perspective, we can also regard the Teacher as the temporal ensemble of the Student models in different time steps. This aligns our observation that the accuracy of the Teacher is consistently higher than the Student. As noted in prior works (Tarvainen and Valpola, 2017; Xie et al., 2020), one crucial factor in improving the Teacher model is the diversity of Student models; we thus use the strongly augmented images as as input of the Student, but we use the weakly augmented images as input of the Teacher to provide reliable pseudo-labels.

Student Learning with Pseudo-Labeling. To address the lack of ground-truth labels for unsupervised data, we adapt the pseudo-labeling method to generate labels for training the Student with unsupervised data. This follows the principle of existing successful examples in semi-supervised image classification task (Lee, 2013; Sohn et al., 2020a). Similar to classification-based methods, to prevent the consecutively detrimental effect of noisy pseudo-labels (i.e., confirmation bias or error accumulation), we first set a confidence threshold of predicted bounding boxes to filter low-confidence predicted bounding boxes, which are more likely to be false positive samples.

While the confidence threshold method have achieved tremendous success in the image classification, it is however not sufficient for object detection. This is because there also exist duplicated box predictions and imbalanced prediction issues in the SS-OD (we leave the discussion of the imbalanced prediction issue in Sec. 3.3). To address the duplicated boxes prediction issue, we remove the repetitive predictions by applying class-wise non-maximum suppression (NMS) before the use of confidence thresholding as performed in STAC (Sohn et al., 2020b).

In addition, noisy pseudo-labels can affect the pseudo-label generation model (Teacher). As a result, we detach the Student and the Teacher. To be more specific, after obtaining the pseudo-labels from the Teacher, only the learnable weights of the Student model is updated via back-propagation.

(2)

Note that we do not apply unsupervised losses for the bounding box regression since the naive confidence thresholding is not able to filter the pseudo-labels that are potentially incorrect for bounding box regression (because the confidence of predicted bounding boxes only indicate the confidence of predicted object categories instead of the quality of bounding box locations (Jiang et al., 2018)).

Teacher Refinement via Exponential Moving Average. To obtain more stable pseudo-labels, we apply EMA to gradually update the Teacher model. The slowly progressing Teacher model can be regarded as the ensemble of the Student models in different training iterations.

(3)

This approach has been shown to be effective in many existing works, e.g., ADAM optimization (Kingma and Ba, 2015), Batch Normalization (Ioffe and Szegedy, 2015), self-supervised learning (He et al., 2020; Grill et al., 2020), and SSL image classification (Tarvainen and Valpola, 2017), while we, for the first time, demonstrate its effectiveness also in alleviating pseudo-labeling bias issue for SS-OD (see next section).

3.3 Bias in Pseudo-label

Ideally, the methods based on pseudo-labels can address problems caused by the scarcity of labels, yet the inherent nature of imbalance in object detection tasks/datasets impedes the effectiveness of pseudo-labeling methods. As mentioned in  (Oksuz et al., 2020), in object detection, there exists foreground-background imbalance (e.g., background instances accounts for 70% of all training instances) and foreground classes imbalance (e.g., human instances accounts for 30% of all foreground training instances in MS-COCO (Lin et al., 2014)). If standard cross-entropy is applied in the condition of insufficient training data, the model is likely prone to predict the dominant classes. This makes the prediction bias toward prevailing classes and leads to the class-imbalance issue in generated pseudo-labels. Relying on the biased pseudo-labels during training makes the imbalanced prediction issue even more severe. To address the imbalance issue in object detection, existing works have proposed several methods (Shrivastava et al., 2016; Lin et al., 2017b; Li et al., 2020).

In this work, we consider a simple yet effective method; we replace the standard cross-entropy with the blackmulti-class Focal loss (Lin et al., 2017b) for the multi-class classification of ROIhead classifier (i.e., ). Focal loss is designed to put more loss weights on the samples with lower-confidence instances. As a result, it makes the model focus on hard samples, instead of the easier examples that are likely from dominant classes. Although the Focal loss is not widely used for vanilla supervised object detection settings (the accuracy of YOLOv3 (Redmon and Farhadi, 2018) even drops if the focal loss is applied), we argue that it is crucial for SS-OD due to the issue of biased pseudo-labels.

On the other hand, we also observe that the EMA training can also alleviate the imbalanced pseudo-labeling biased issue due to the conservative property of the EMA training. blackTo be more specific, with the EMA mechanism, the new Teacher model is regularized by the previous Teacher model, and this prevents the decision boundary from drastically moving toward the minority classes. In detail, the weights of the Teacher model can be represented as follows:

(4)

where is the model weight after the burn-in stage, is the Teacher model weight in -th iteration, is the Student model weight in -th iteration, is the learning rate, and is the EMA coefficient.

black The regularization of the previous Teacher model is equivalent to putting an additional small coefficient on the gradients on Student models in previous steps. With the slowly altered decision boundary (i.e., higher stability), the pseudo-labels of these unlabeled instances are less likely to change dramatically, and this prevents the decision boundary from moving toward minority classes (i.e., majority class bias). Thus, the EMA-trained Teacher model is beneficial for producing more stable pseudo-labels and addressing the class-imbalance issue in SS-OD.

We note that the class-imbalance issue is crucial when using pseudo-labeling method to address semi-supervised or other low-label object detection tasks. There indeed exist other class-imbalance methods that can potentially improve the performance, but we leave this for future research.

4 Experiments

Datasets. We benchmark our proposed method on experimental settings using MS-COCO (Lin et al., 2014) and PASCAL VOC (Everingham et al., 2010) following existing works (Jeong et al., 2019; Sohn et al., 2020b). Specifically, there are three experimental settings: (1) COCO-standard: we randomly sample 0.5, 1, 2, 5, and 10% of labeled training data as a labeled set and use the rest of the data as the training unlabeled set. (2) COCO-additional: we use the standard labeled training set as the labeled set and the additional COCO2017-unlabeled data as the unlabeled set. (3) VOC: we use the VOC07 trainval set as the labeled training set and the VOC12 trainval set as the unlabeled training set. Model performance is evaluated on the VOC07 test set.

Implementation Details. For a fair comparison, we follow STAC (Sohn et al., 2020b) to use Faster-RCNN with FPN (Lin et al., 2017a) and ResNet-50 backbone (He et al., 2016)

as our object detectior, where the feature weights are initialized by the ImageNet-pretrained model, same as existing works 

(Jeong et al., 2019; Sohn et al., 2020b). We use confidence threshold . For the data augmentation, we apply random horizontal flip for weak augmentation and randomly add color jittering, grayscale, Gaussian blur, and cutout patches for strong augmentations. Note that we do not apply any geometric augmentations, which are used in STAC. We use

(denoted as mAP) as evaluation metric, and the performance is evaluated on the Teacher model. More training and implementation details can be found in the Appendix.

4.1 Results

COCO-standard
0.5% 1% 2% 5% 10%
Supervised 6.83 0.15 9.05 0.16 12.70 0.15 18.47 0.22 23.86 0.81
CSD* 7.41 0.21 blue(+0.58) 10.51 0.06 blue(+1.46) 13.93 0.12 blue(+1.23) 18.63 0.07 blue(+0.16) 22.46 0.08 red(-1.40)
STAC 9.78 0.53 blue(+2.95) 13.97 0.35 blue(+4.92) 18.25 0.25 blue(+5.55) 24.38 0.12 blue(+5.86) 28.64 0.21 blue(+4.78)
Unbiased Teacher 16.94 0.23 blue(+10.11) 20.75 0.12 blue(+11.72) 24.30 0.07 blue(+11.60) 28.27 0.11 blue(+9.80) 31.50 0.10 blue(+7.64)
Table 1: Experimental results on COCO-standard comparing with CSD (Jeong et al., 2019) and STAC (Sohn et al., 2020b). *: we implement the CSD method and adapt it on the MS-COCO dataset. The results of with STAC is from their released code.
COCO-additional
Supervised (1x) Supervised (3x) CSD (3x) STAC (6x) Ours (3x)
37.63 40.20 38.82 39.21 41.30
Table 2: black Experimental results on COCO-additional comparing with CSD (Jeong et al., 2019) and STAC (Sohn et al., 2020b). *: we implement the CSD method and adapt it on the MS-COCO dataset. Note that 1x represents 90K training iterations, and x represents 90K training iterations.

COCO-standard. We first evaluate the efficacy of our Unbiased Teacher on COCO-standard (Table 1). When there are only to of data labeled, our model consistently performs favorably against the state-of-the-art methods, CSD (Jeong et al., 2019) and STAC (Sohn et al., 2020b). It is worth noting that our model trained on labeled data achieves mAP, which is even higher than STAC trained on labeled data (mAP 18.25), CSD trained on labeled data (mAP 18.57), and the supervised baseline trained on labeled data (mAP 18.47). We also observe that, as there are less labeled data, the improvements between our method and the existing approaches becomes larger. Unbiased Teacher consistently shows around 10 absolute mAP improvements when using less than of labeled data compared to supervised method. We attribute the improvements to several crucial factors:

1) More accurate pseudo-labels. When leveraging the pseudo-labeling and consistency regularization between two networks (Teacher and Student in our case), it is critical to make sure pseudo-labels are accurate and reliable. Existing method attempts to do this by training the pseudo-label generation model using all the available labeled data and is completely frozen afterwards. In contrast, in our framework, our pseudo-label generation model (Teacher) continues to evolve gradually and smoothly via Teacher-Student Mutual Learning. This enables the Teacher to generate more accurate pseudo-labels as presented in Figure 4, which are properly exploited in the training of the Student.

(a)(b)(c)
Figure 4: Pseudo-label improvement on (a) accuracy, (b) mIoU, and (c) number of bounding boxes in the case of COCO-standard 1% labeled data. We measure the (a) accuracy and (b) mIoU by comparing the ground-truth boxes and pseudo boxes. blackThe Burn-In limit curves indicate the pseudo-boxes obtained from the model right after the Burn-In stage without further refinement (i.e., the model trained on labeled data only). GT curve on the number of boxes figure indicates the averaged number of bounding boxes in the GT labels, and we showed that there are around bounding boxes per image on average in MS-COCO. This result indicates our model can generate more accurate pseudo-labels after the Burn-In stage (i.e., 2k iterations).

2) Class-imbalance on pseudo-labels. Our improvement also comes from both the use of the EMA and the Focal loss (Lin et al., 2017b), which addresses the class-imbalanced pseudo-labeling issue. As mentioned in Sec. 3.3, using more balanced pseudo-labels not only avoids the consecutive biased prediction problem but also benefits the predictions on the minority classes. Later in Sec. 4.2, we present the details of the ablation study on the EMA and the Focal loss.

COCO-additional and VOC. In the previous section, we presented Unbiased Teacher can successfully leverage very small amounts of labeled data. We now aim to verify whether the model trained on supervised data can be further improved by using additional unlabeled data. We thus consider COCO-additional and VOC and present the results in Table 1 and 3.

In the case of COCO-additional (Table 2), compared with supervised only model, our model has a black absolute AP improvement. We also found a similar trend in the VOC experiment (Table 3). With VOC07 as labeled set and VOC12 as an additional unlabeled set, STAC shows absolute mAP improvement with respect to the supervised model, whereas our model demonstrates absolute mAP improvement. To further examine whether increasing the size of unlabeled data can further improve the performance, we follow CSD and STAC to use COCO20cls dataset333COCO20cls is generated by only leaving COCO images which have object categories that overlap with the object categories used in PASCAL VOC07. as an additional unlabeled set. STAC shows absolute mAP improvement, while our model achieves absolute mAP improvement. These results demonstrate that our model can further improve the object detector trained on existing labeled datasets by using more unlabeled data. Note that, following STAC, we use a more challenging metric, , which averages the ten values over to since the metric of has been indicated as a saturated metric by the prior work (Cai and Vasconcelos, 2018; Sohn et al., 2020b).

 

Backbone Labeled Unlabeled
Supervised (from Ours) ResNet50-FPN VOC07 None 72.63 42.13
CSD ResNet101-R-FCN VOC07 VOC12 74.70 blue(+2.07) -
STAC ResNet50-FPN 77.45 blue(+4.82) 44.64 blue(+2.51)
Unbiased Teacher ResNet50-FPN 77.37 blue(+4.74) 48.69 blue(+6.56)
CSD ResNet101-R-FCN VOC07
VOC12
+
COCO20cls
75.10 blue(+2.47) -
STAC ResNet50-FPN 79.08 blue(+6.45) 46.01 blue(+3.88)
Unbiased Teacher ResNet50-FPN 78.82 blue(+6.19) 50.34 blue(+8.21)

 

Table 3: Results on VOC comparing with CSD (Jeong et al., 2019) and STAC (Sohn et al., 2020b).

4.2 Ablation Study

(a)(b)
Figure 5: Ablation study on the EMA and the Focal loss in the case of COCO-standard labeled data. (a) mAP of the models using the Focal loss or cross-entropy and applying the EMA or standard training. (b) Class empirical distribution (i.e., histogram) of pseudo-labels generated by each model and compute -divergence between the ground-truth labels distribution and the pseudo-label distribution. Among these models, the model using the Focal loss and EMA training (i.e., green curve) achieves the best mAP with the most balanced pseudo-labels .

Effect of the EMA training. We first examine the effect of EMA training and present a comparison between our model with EMA and without EMA. Our model without EMA is where the model weights of Teacher and Student are shared during the training stage, and it implies the Teacher model is also updated when the student model is optimized by using unlabeled data and pseudo-labels. Note that the state-of-the-art semi-supervised classification model, FixMatch (Sohn et al., 2020a) similarly shares the model weights of the Teacher and the Student models.

From Figure 5, we observe that our model with EMA is superior to without EMA, and this trend can be found both in the model using the Focal loss and cross-entropy. To further analyze the diverged results, we visualize the class distribution of pseudo-labels generated by each model and measure the -divergence between the ground-truth labels distribution and the pseudo-labels distribution. With the use of cross-entropy and standard training (i.e., without EMA training), the model generates the imbalanced pseudo-labels. To be more specific, the instances of most object categories in pseudo-labels disappear, while only instances of specific object categories remain. We observe that using the EMA training can alleviate the imbalanced pseudo-labels issue and reduces the -divergence from to . On the other hand, we also observe that the model with EMA has a smoother learning curve compared with the model without EMA. This is because the model weight of the pseudo-label generation model (Teacher) is detached from the optimized model (Student). The pseudo-label generation model can thus prevent the detrimental effect caused by the noisy pseudo-labels (e.g., false positive boxes) as we describe in Sec. 3.2.

In sum, the EMA training has several advantages: it 1) prevents the imbalanced pseudo-labels issue caused by the imbalanced nature in low-labeled object detection tasks, 2) prevents the detrimental effect caused by the noisy pseudo-labels, and 3) the Teacher model can be regarded as the temporal ensembles model of Student models in different time steps.

Effect of the Focal loss. In addition to the EMA training, we also verify the effectiveness of the Focal loss. As presented in Figure 5, the model using Focal loss can perform favorably against the model using cross-entropy. The model trained with the Focal loss can generate the pseudo-label which distribution is more similar to the distribution of ground-truth labels, and it can improve the -divergence from (Cross entropy w/o EMA) to (Focal loss w/o EMA) and mAP from to . When EMA training is applied, the -divergence of the model with the Focal loss can be further improved from (Cross entropy w/ EMA) to (Focal loss w/ EMA) and mAP improve from to . This confirms the effectiveness of the Focal loss in handling the class imbalance issues existed in the semi-supervised object detection. The reduction of -divergence (i.e., better-fitting pseudo-label distributions to ground-truth label distributions) results in the mAP improvement.

Other ablation studies. We also ablate the effects of the Burn-In stage, pseudo-labeling thresholding, EMA rates, and unsupervised loss weights in the Appendix.

5 Conclusion

In this paper, we revisit the semi-supervised object detection task. By analyzing the object detectors in low-labeled scenarios, we identify and address two major issues: overfitting and class imbalance. We proposed Unbiased Teacher — a unified framework consisting of a Teacher and a Student that jointly learn to improve each other. In the experiments, we show our model prevents pseudo-labeling bias issue caused by class imbalance and overfitting issue due to labeled data scarcity. Our Unbiased Teacher achieves satisfactory performance across multiple semi-supervised object detection datasets.

6 Acknowledgments

Yen-Cheng Liu and Zsolt Kira were partly supported by DARPA’s Learning with Less Labels (LwLL) program under agreement HR0011-18-S-0044, as part of their affiliation with Georgia Tech.

References

  • D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel (2020) Remixmatch: semi-supervised learning with distribution alignment and augmentation anchoring. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
  • D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) Mixmatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5049–5059. Cited by: §1, §2.
  • Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §4.1.
  • T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    .
    arXiv preprint arXiv:1708.04552. Cited by: §A.4.
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV) 88 (2), pp. 303–338. Cited by: §4.
  • J. Gao, J. Wang, S. Dai, L. Li, and R. Nevatia (2019) Note-rcnn: noise tolerant ensemble rcnn for semi-supervised object detection. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 9508–9517. Cited by: §2.
  • Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems (NeurIPS), pp. 529–536. Cited by: §1.
  • J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §3.2, footnote 2.
  • H. Guo, Y. Mao, and R. Zhang (2019) Mixup as locally linear out-of-manifold regularization. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    ,
    Vol. 33, pp. 3714–3722. Cited by: §2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738. Cited by: §3.2, footnote 2.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2961–2969. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §A.4, §4.
  • D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan (2020) AugMix: a simple data processing method to improve robustness and uncertainty. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §2.
  • J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko (2014) LSDA: large scale detection through adaptation. In Advances in Neural Information Processing Systems ((NeurIPS)), pp. 3536–3544. Cited by: §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the International Conference on Machine Learning (ICML)

    ,
    Cited by: §3.2, footnote 2.
  • J. Jeong, S. Lee, J. Kim, and N. Kwak (2019) Consistency-based semi-supervised learning for object detection. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §A.3, §A.4, Figure 1, §2, §4.1, Table 1, Table 2, Table 3, §4, §4.
  • B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang (2018) Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §3.2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §3.2, footnote 2.
  • C. Kuo, C. Ma, J. Huang, and Z. Kira (2020) FeatMatch: feature-based augmentation for semi-supervised learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
  • S. Laine and T. Aila (2017) Temporal ensembling for semi-supervised learning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §A.4.
  • D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3, pp. 2. Cited by: §1, §1, §3.2.
  • Y. Li, T. Wang, B. Kang, S. Tang, C. Wang, J. Li, and J. Feng (2020) Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017a) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §A.4, §2, §4.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017b) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (CVPR), pp. 2980–2988. Cited by: §1, §3.3, §3.3, §3, §4.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §3.3, §4.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision (ECCV), pp. 21–37. Cited by: §2.
  • I. Misra, A. Shrivastava, and M. Hebert (2015) Watch and learn: semi-supervised learning for object detectors from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3593–3602. Cited by: §2.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence (PAMI) 41 (8), pp. 1979–1993. Cited by: §2.
  • K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas (2020) Imbalance problems in object detection: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §3.3.
  • S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille (2018) Deep co-training for semi-supervised image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–152. Cited by: §2.
  • J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 7263–7271. Cited by: §2.
  • J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §3.3.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NeurIPS), pp. 91–99. Cited by: §3.1.
  • C. Rosenberg, M. Hebert, and H. Schneiderman (2005) Semi-supervised self-training of object detection models. In 2005 Seventh IEEE Workshops on Applications of Computer Vision, Vol. 1, pp. 29–36. Cited by: §2.
  • M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1163–1171. Cited by: §1, §2.
  • A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
  • K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel (2020a) Fixmatch: simplifying semi-supervised learning with consistency and confidence. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §1, §3.2, §4.2.
  • K. Sohn, Z. Zhang, C. Li, H. Zhang, C. Lee, and T. Pfister (2020b) A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757. Cited by: §A.3, §A.4, §A.4, §A.4, §A.4, Figure 1, §1, §2, §3.2, §3.2, §4.1, §4.1, Table 1, Table 2, Table 3, §4, §4.
  • P. Tang, C. Ramaiah, R. Xu, and C. Xiong (2020) Proposal learning for semi-supervised object detection. arXiv preprint arXiv:2001.05086. Cited by: §A.4.
  • Y. Tang, J. Wang, B. Gao, E. Dellandréa, R. Gaizauskas, and L. Chen (2016) Large scale semi-supervised object detection using visual and semantic knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2119–2128. Cited by: §2.
  • A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems (NeurIPS), pp. 1195–1204. Cited by: §1, §2, §3.2, §3.2, footnote 2.
  • Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §A.4.
  • Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2020)

    Self-training with noisy student improves imagenet classification

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.
  • B. Yu, J. Wu, J. Ma, and Z. Zhu (2019) Tangent-normal adversarial regularization for semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10676–10684. Cited by: §2.
  • S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6023–6032. Cited by: §2.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In Proc. International Conference on Learning Representations (ICLR), Cited by: §2.
  • Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. arXiv preprint arXiv:1708.04896. Cited by: Table 6.

Appendix A Appendix

black

a.1 EMA on imbalanced pseudo-labeling issue

To empirically examine the effectiveness of EMA on imbalance, we present the pseudo-label distribution in different training iterations as presented in Figure 6. At the beginning of training (i.e., 30k), both Teacher models with and without EMA could generate the balanced pseudo-labels (the KL divergence between ground-truth labels and pseudo-labels are both small). However, since the Student model is trained with the pseudo-labels generated by the Teacher models, the model without EMA starts biasing towards specific classes. In contrast, with the EMA training, the model generates less imbalanced pseudo-labels. Note that, although the EMA is applied, the balance issue still exists. We thus apply Focal loss to enhance the ability to mitigate the imbalance issue further.

30K iterations140K iterations
Figure 6: blackAblation study on EMA at different training iterations. Both the models with EMA and without EMA have pseudo-label distributions, which are similar to the ground-truth distributions in the early stage of training iterations. However, the model without EMA tends to generate more biased pseudo-label distribution later during training.

a.2 Additional Ablation Study

In addition to the ablation studies provided in the main paper, we further ablate Unbiased Teacher in the following sections.

a.2.1 Effect of Burn-In Stage

As mentioned in Section 3.1, it is crucial to have a good initialization for both Student and Teacher models. We thus present a comparison between the model with and without the Burn-In stage in Figure 7. We observe that, with the Burn-In stage, the model can derive more accurate pseudo-boxes in the early stage of the training. As a result, the model can achieve higher accuracy in the early stage of the training, and it also achieves better results when the model is converged.

(a) mAP(b) Box Accuracy(c) mIoU(d) Number of Pseudo-Boxes
Figure 7: In the case of COCO-standard labeled data, (a) Unbiased Teacher with Burn-In stage achieve higher mAP against Unbiased Teacher without Burn-In stage. Using Burn-In Stage results in the early improvement of (b) box accuracy and (c) mIoU. (d) Unbiased Teacher with Burn-In stage can derive more pseudo-boxes than Unbiased Teacher without Burn-In stage.

a.2.2 Effect of Pseudo-Labeling Threshold

As mentioned in Section 3.3, we apply confidence thresholding to filter these low-confidence predicted bounding boxes, which are more likely to be false-positive instances. To show the effectiveness of thresholding, we first provide the accuracy of predicted bounding boxes before and after the pseudo-labeling in Figure 8.

Figure 8: Pseudo-label accuracy improvement with the use of confidence thresholding. We measure the accuracy by comparing the ground-truth labels and predicted labels before and after confidence thresholding. This result indicates that confidence thresholding can significantly improve the quality of pseudo-labels.

When varying the threshold value from to , as expected, the number of generated pseudo-boxes increases as the threshold reduces (Figure 9). The model using excessively high threshold (e.g., ) cannot perform satisfactory results, as the number of generated pseudo-labels is very low. On the other hand, the model using a low threshold (e.g., ) also cannot achieve favorable results since the model generates too many bounding boxes, which are likely to be false-positive instances. We also observe that the model cannot even converge if the threshold is below .

(a)(b)
Figure 9: (a) Validation AP and (b) number of pseudo-label bounding boxes per image with various pseudo-labeling thresholds . With an excessively low threshold (e.g., ), the model has a lower AP, as it predicts more pseudo-labeled bounding boxes compared to the number of bounding boxes in ground-truth labels. On the other hand, the performance of the model using an excessively high threshold (e.g., ) drops as it cannot predict sufficient number of bounding boxes in its generated pseudo-labels.

a.2.3 Effect of EMA Rates

We also evaluate the model using various EMA rate from to and present the mAP result of the Teacher model in Figure 10. We observe that, with a smaller EMA rate (e.g.,

), the model has lower mAP and higher variance, as the Student contributes more to the Teacher model for each iteration. This implies the Teacher model is likely to suffer from the detrimental effect caused by noisy pseudo-labels. This unstable learning curve can be stabilized and improved as the EMA rate

increases. When the EMA rate achieves , it performs the best mAP. However, if the EMA rate keeps increasing, the teacher model will grow overly slow as the Teacher model derive the next model weight mostly from the previous Teacher model weight.

(a)(b)
Figure 10: Validation AP on the Teacher model with various MMA rates . (a) With a small MMA rate (e.g., ), the Teacher model has lower AP and larger variance. In contrast, as the MMA rate grows to , the Teacher model can gradually improve along the training iterations. However, when the MMA grows to , the Teacher model grows overly slow but has lowest variance. (b) We breakdown the AP metric into APs from to .

a.2.4 Effect of Unsupervised Loss Weights

To examine the effect unsupervised loss weights, we vary the unsupervised loss weight from to in the case of COCO-standard labeled data. As shown in Table 4, with a lower unsupervised loss weight , the model performs . On the other hand, we observe that the model performs the best with unsupervised loss weight . However, when the weight increases to , the training of the model cannot converge.

1.0 2.0 4.0 5.0 6.0 8.0
AP () 29.30 30.64 31.82 32.00 31.80 Cannot Converge
Table 4: Ablation study of varying unsupervised loss weight on the model trained using labeled and unlabeled data.

a.3 AP breakdown for COCO-standard

We present an AP breakdown for COCO-standard labeled data. As mentioned in Section 4, our proposed model can perform favorably against both STAC (Sohn et al., 2020b) and CSD (Jeong et al., 2019). This trend appears in all evaluation metrics from to , as shown in Figure 11, and it confirms that our model is preferable for handling extremely low-label scenario compared to the state of the arts.

Figure 11: Evaluation metric breakdown of all methods on labeled data.

a.4 Implementation and Training Details

Network and framework. Our implementation builds upon the Detectron2 framework (Wu et al., 2019). For a fair comparison, we follow the prior work (Sohn et al., 2020b) to use Faster-RCNN with FPN (Lin et al., 2017a) and ResNet-50 backbone (He et al., 2016) as our object detection network.

Training. At the beginning of the Burn-In stage, the feature backbone network weights are initialized by the ImageNet-pretrained model, which is same as existing works (Jeong et al., 2019; Tang et al., 2020; Sohn et al., 2020b). We use the SGD optimizer with a momentum rate and a learning rate , and we use constant learning rate scheduler. The batch size of supervised and unsupervised data are both images. For the COCO-standard, we train k iterations, which includes k iterations for in the Burn-In stage and the remaining iterations in the Teacher-Student Mutual Learning stage. For the COCO-additional, we train k iterations, which includes k iterations in the Burn-Up stage and the remaining k iterations in the Teacher-Student Mutual Learning stage.

Hyper-parameters. We use confidence threshold to generate pseudo-labels for all our experiments, the unsupervised loss weight is applied for COCO-standard and VOC, and the unsupervised loss weight is applied for COCO-additional. We apply as the EMA rate for all our experiments. Hyper-parameters used are summarized in Table 5.

Hyper-parameter Description COCO-standard and VOC COCO-additional
Confidence threshold 0.7 0.7
Unsupervised loss weight 4 2
EMA rate 0.9996 0.9996
Batch size for labeled data 32 16
Batch size for unlabeled data 32 16
Learning rate 0.01 0.01
Table 5: Meanings and values of the hyper-parameters used in experiments.

Data augmentation. As shown in Table 6, we apply randomly horizontal flip for weak augmentation and randomly add color jittering, grayscale, Gaussian blur, and cutout patches (DeVries and Taylor, 2017) for the strong augmentation. Note that we do not apply any image-level or box-level geometric augmentations, which are used in STAC (Sohn et al., 2020b). In addition, we do not aggressively search the best hyper-parameters for data augmentations, and it is possible to obtain better hyper-parameters.

Weak Augmentation
Process Probability Parameters Descriptions
Horizontal Flip 0.5 - None
Strong Augmentation
Process Probability Parameters Descriptions
Color Jittering 0.8
(brightness, contrast, saturation, hue)
= (0.4, 0.4, 0.4, 0.1)
Brightness factor is chosen uniformly from [0.6, 1.4],
contrast factor is chosen uniformly from [0.6, 1.4],
saturation factor is chosen uniformly from [0.6, 1.4],
and hue value is chosen uniformly from [-0.1, 0.1].
Grayscale 0.2 None None
GaussianBlur 0.5 (sigma_x, sigma_y) = (0.1, 2.0) Gaussian filter with and is applied.
CutoutPattern1 0.7 scale=(0.05, 0.2), ratio=(0.3, 3.3)
Randomly selects a rectangle region in an image
and erases its pixels. We refer the detail in Zhong et al. (2017).
CutoutPattern2 0.5 scale=(0.02, 0.2), ratio=(0.1, 6)
Randomly selects a rectangle region in an image
and erases its pixels. We refer the detail in Zhong et al. (2017).
CutoutPattern3 0.3 scale=(0.02, 0.2), ratio=(0.05, 8)
Randomly selects a rectangle region in an image
and erases its pixels. We refer the detail in Zhong et al. (2017).
Table 6: Detail of data augmentations. Probability in the table indicates the probability of applying the corresponding image process.

Evaluation Metrics. is used to evaluate all methods following the prior works (Law and Deng, 2018; Sohn et al., 2020b).