Temporal Self-Ensembling Teacher for Semi-Supervised Object Detection

07/13/2020
by   Cong Chen, et al.
11

This paper focuses on the problem of Semi-Supervised Object Detection (SSOD). In the field of Semi-Supervised Learning (SSL), the Knowledge Distillation (KD) framework which consists of a teacher model and a student model is widely used to make good use of the unlabeled images. Given unlabeled images, the teacher is supposed to yield meaningful targets (e.g. well-posed logits) to regularize the training of the student. However, directly applying the KD framework in SSOD has the following obstacles. (1) Teacher and student predictions may be very close which limits the upper-bound of the student, and (2) the data imbalance dilemma caused by dense prediction from object detection hinders an efficient consistency regularization between the teacher and student. To solve these problems, we propose the Temporal Self-Ensembling Teacher (TSE-T) model on top of the KD framework. Differently from the conventional KD methods, we devise a temporally updated teacher model. First, our teacher model ensembles its temporal predictions for unlabeled images under varying perturbations. Second, our teacher model ensembles its temporal model weights by Exponential Moving Average (EMA) which allows it gradually learn from student. The above self-ensembling strategies collaboratively yield better teacher predictions for unblabeled images. Finally, we use focal loss to formulate the consistency regularization to handle the data imbalance problem. Evaluated on the widely used VOC and COCO benchmarks, our method has achieved 80.73 on the VOC2007 test set and the COCO2012 test-dev set respectively, which outperforms the fully-supervised detector by 2.37 method sets the new state state of the art in SSOD on VOC benchmark which outperforms the baseline SSOD method by 1.44 publicly available at <http://github.com/SYangDong/tse-t.>

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 10

01/02/2019

Learning Efficient Detector with Semi-supervised Adaptive Distillation

Knowledge Distillation (KD) has been used in image classification for mo...
06/19/2021

Humble Teachers Teach Better Students for Semi-Supervised Object Detection

We propose a semi-supervised approach for contemporary object detectors ...
09/03/2019

Dual Student: Breaking the Limits of the Teacher in Semi-supervised Learning

Recently, consistency-based methods have achieved state-of-the-art resul...
04/20/2021

SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud

We present Self-Ensembling Single-Stage object Detector (SE-SSD) for acc...
07/12/2018

Deep semi-supervised segmentation with weight-averaged consistency targets

Recently proposed techniques for semi-supervised learning such as Tempor...
07/14/2020

Semi-supervised Learning with a Teacher-student Network for Generalized Attribute Prediction

This paper presents a study on semi-supervised learning to solve the vis...
07/19/2020

Self-similarity Student for Partial Label Histopathology Image Segmentation

Delineation of cancerous regions in gigapixel whole slide images (WSIs) ...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Object detection is the cornerstone of computer vision, as many high level vision tasks fundamentally rely on the ability to recognize and localize visual objects. Object detection thus touches many areas of artificial intelligence and information retrieval, such as image search, data mining, question answering, autonomous driving, medical diagnosis, robotics and many others

[30, 32, 51, 5, 33]

. The recent resurgence of interest in artificial neural networks, in particular deep learning, has tremendously advanced the field of generic object detection, and in the past few years a large number of detectors

[42, 28, 16, 38, 31, 25, 3, 30] have sprung up to improve the detection performance from some aspects like accuracy, efficiency or robustness.

Current state of the art detectors [42, 28, 16, 31, 3] are learned in a fully supervised fashion, which requires large scale labeled data with many high quality object bounding box annotations or even segmentation masks. Gathering bounding box annotations or segment masks for every object instance is time consuming and expensive, especially when the training dataset contains a huge number of images or even videos, as it requires intensive efforts of experienced human annotators or experts (e.g., medical image annotation) [9, 29, 43, 23, 30]. Furthermore, manual bounding box/segmentation mask labeling may introduce a certain amount of subjective bias. In addition, the generalizability of fully supervised detectors is limited. By contrast, there are massive amounts of unlabeled images which are acknowledged valuable, and the key is how to make good use of them [47, 11, 21, 36, 15, 22, 22, 14, 7]

The time consuming and expensive annotation process of accurate bounding boxes of object instances is sidestepped in Weakly Supervised Object Detection (WSOD), which only utilizes image level annotations that show the presence of instances of an object category [34, 49, 50]. WSOD methods may achieve a relatively good performance if provided with a large number of image level annotations, however the performance is hardly competitive to their fully supervised counterparts. Considering a generic situation in object detection, we have a limited number of labeled images [9, 29], but a huge number of unlabeld images (e.g

., the massive amounts of unlabeled image available from the Internet). Thus, Semi-Supervised Learning (SSL), which falls between supervised and unsupervised learning, has shown promising results to reduce the gap between. SSL has been extensively studied in image classification problem

[55, 54, 4, 26], while it has received significantly less attention in object detection. In this work, our main focus is SSL for object detection.

Classical deep learning based SSL methods use the maximum predictions for unlabeled images as pseudo labels to improve the classification performance of the neural networks [26]. The recently developed Knowledge Distillation (KD) [18, 35] aims at training a light weight student model regularized by a cumbersome teacher model, which was originally used for deep model compression but later widely used to solve SSL problems. Quite a few KD based SSL methods have been proposed [37, 45, 24, 47], and the key to these methods is to construct a well-performed teacher to obtain stable and reliable predictions when giving unlabeled images during training. The teacher predictions for unlabeled images can be used as targets (well-posed logits or soft labels) to regularize the training of the student in order to obtain similar predictions on the unlabeled images, yielding a well-trained student to approach the performance of the teacher. This can be implemented by using the consistency regularization between the teacher and student predictions which routinely takes the form of Mean Squared Error (MSE) loss.

So far, however, only a limited number of works have applied similar ideas in a more challenging task, like SSOD [21, 46]. The main challenges are as follows. (1) The conventional SSL methods are devised to solely handle a unique prediction per image. As for the object detection which is a more complicated task to identify objects category and localize them simultaneously, a teacher model from SSL thus may produce very close predictions as the student model. [24, 47]

. The risk behind is that the performance improvement of the student may be limited using unlabeled images. (2) The predictions in object detection are rather dense during training because an object is probable to present at every location in an image. This issue is easy to handle in supervised object detection because a unique ground-truth is provided. However, this is difficult to tackle for SSOD because the teacher predictions takes the role to provide “annotation” for student model and these “annotations” may lead to severe data imbalance problem. Therefore, a direct adaptation of the commonly used consistency regularization term from SSL to SSOD is hampered. A recent method named Consistency based Semi-supervised Learning for Object Detection (CSD)

[21] tackles this problem by simply thresholding out the low confident predictions. However, there are several limitations of this work. (1) The teacher and student are identical which may result in similar predictions for the unlabeled images. (2) The simple thresholding-out strategy may ignore useful information from unlabeled images.

In this work, we aim at a simple but generic solution to alleviate the above issues, further improving the SSOD. To this end, we propose the Temporal Self-Ensemble Teacher model, coined TSE-T. We show the framework of our method in Fig.1. TSE-T model is devised on top of the KD framework which consists of a teacher and a student model. Both the teacher and student are initiated from a pre-trained detection network using fully supervised manner. At training time, given unlabeled images, the teacher first obtains the predictions for both category and localization of all possible objects presenting in the images. The student also obtains its detections for the unlabeled images. In KD framework, the teacher predictions are used to regularize the training of the student, which is implemented by using the consistency regularization between the teacher and student predictions. At inference time, the trained student model is deployed for the object detection in unseen images.

Our TSE-T model is an efficient and effective training approach using unlabeled images in SSOD, which includes the following novelties.

(1) Differently from the original KD based methods [18, 36] which keep the teacher constant, our TSE-T first ensembles its temporal predictions for the unlabeled images under various perturbations (random transformations like horizontal flip). This type of data augmentation strategy has been used to effectively improve the prediction accuracy in SSL [24]; Second, our TSE-T model uses Exponential Moving Average (EMA) to gradually update the teacher model weights which allows the teacher to learn from the student to enhance itself. This leads to the temporally diverse teacher model during training. The self-ensembling of temporal teacher predictions and the self-ensembling of the temporal teacher model weights together increase the data and model diversity, thus yielding stable and reliable teacher predictions for unlabeled images which consequently are used as better targets to train the student.

(2) In order to avoid using hard-thresholding and preserve useful information from unlabeled images as much as possible, we employ the customized detection loss, i.e.the focal loss [28] to formulate the consistency regularization between teacher and student predictions. The focal loss rewards the hard examples but penalizes easy negatives which dramatically simplifies and improves the usage of the unlabeled images in SSOD.

We have evaluated the performance of our TSE-T model on two standard large scale benchmarks VOC and COCO. Both evaluation results have shown that TSE-T model can obtain remarkable improvements compared to its lower-bound, the fully-supervised detection model only using labeled images. Specifically, the mAP of our method achieves and on VOC2007 test set and COCO2012 test-dev set respectively, outperforming the baseline by and . It should noted that our method sets the state of the art performance in SSOD on VOC2007 benchmark.

We summarize our contributions as follows:

  • We formally employ the KD framework in SSOD task which constructs a well-trained teacher to regularize the training of a student using unlabeled images.

  • We propose TSE-T model which simultaneously ensembles the temporal predictions and model weights. Our method produces better targets to train the student but does not significantly increase computational complexity.

  • We use focal loss to solve the data imbalance problem, which simplifies the usage of unlabeled images in SSOD.

The rest of the paper is organized as follows. We review related works in Section II. We elaborate the proposed method in Section III. We describe experimental results in Section IV. Finally, in Section V we conclude our work and present several potential directions for future work.

Fig. 2: A Detailed Graphical Illustration for the Proposed TSE-T Model. Our method is made on top of the KD framework which consists of a teacher and student model. Our TSE-T model is devised to simultaneously ensemble the temporal predictions and ensemble model weights from the student, yielding better predictions which are then used as the target to guide the training of the student. We use orange bounding boxes to indicate our main contributions.

Ii Related Works

In this section, we review the related topics to our work, including object detection (Section II-A), semi-supervised learning models (Section II-B) and model ensemble (Section II-D).

Ii-a Object detection

Object detection is one of the most active research topics in computer vision community [48]. There have been developed hundreds of well-performed detectors. In this work, we focus on the generic object detection models using deep learning [30]. The pioneered work R-CNN used the deep learning methods to extract features in the conventional object detection pipeline [12]. The fast-rcnn [13] and faster-rcnn [42] initiated the study on typical two-stage detectors which successfully implemented the object detection with an end-to-end deep learning architecture. The FPN [27] and RetinaNet [28] improved the feature representation for object detection by using a decoder-like feature pyramid. To be continued, one-stage detectors, including DenseBox [20] and SSD [31], were developed, which generate dense predictions using fully convolutional neural networks. This type of methods are much faster, one extraordinary trend of which refers to as the YOLO [38, 39, 40]. The mask-rcnn [16] proposed the multi-task network integrating the object detection and semantic segmentation which reformatted the instance segmentation. All the above methods are using the popular anchor boxes to encode the object bounding box leading to a translation-invariant detection and relieve the difficulty of regression. Recent developed anchor-free detectors [25, 52, 53, 8] reformulated the object detection as a key points detection and grouping task. This line of object detection methods reduces the quantity of output but still achieves comparable performance.

Ii-B Semi-supervised learning

The semi-supervised learning (SSL) is one important category of machine learning techniques

[55, 54, 4], which aims to train a machine learning model by using a limited number of labeled data and a large amount of unlabeled data. The key to the semi-supervised learning models is to obtain a better prediction on the unlabeled data. Since the emergence of knowledge distillation network [18], the semi-supervised learning has been reshaped based on the teacher-student model architecture. A well-posed prediction for the unlabeled data becomes possible using a cumbersome teacher model, and the result is used to guide the training of a light-weighted student model. The model devised one clean branch and one noisy branch, which learned an auxiliary mapping between the two branches for denoising [37]. The model tried to stabilize the predictions obtained from stochastic data transformation and network perturbations. The subjective was to minimize the predictions difference of the same data when introducing various stochastic transformations and passing the data through perturbed networks. The temporal ensembling model improved the prediction for the unlabeled data by accumulating the predictions during training [24]. The ensembling of multiple networks have been proved to be an effective strategy to produce more accurate predictions [44]. In the field of SSL, the temporal self-ensembling during training may provide with better predictions for the unlabeled images which can be used better targets to train the student. Instead of ensembling the predictions, the mean teacher model [47] ensembled the temporal teacher model weights and the student model weights to yield a dynamic teacher model that can learn from the student. This resulted in a temporally evolved teacher model, so the predictions of unlabeled images from the teacher and student model became diverse which benefited the training of the student.

Ii-C Semi-supervised object detection

A successful trial on semi-supervised object detection using deep learning techniques was the CSD model which adapted the model to construct the consistent regularization for the detection of the unlabeled image and its augmentation. The CSD was evaluated on VOC dataset and achieved the state-of-the-art performance. A very recent semi-supervised method developed a proposal-based learning scheme for two-stage object detectors [46]. For the original data and its noisy counterpart, the method used a self-supervised proposal learning module to learn consistent perceptual semantics in feature space and consistent predictions. The method was only evaluated on COCO dataset and has achieved similar results compared to the omni-supervised object detection [36]. Our work bears a certain resemblance to the omni-supervised object detection. This work used two-stage detectors as detection model and proposed a bounding box voting strategy to generate the a hard-label teacher prediction. Compared to this work, our method is prioritized in the following aspects. (1) Our method keeps the teacher model dynamic to learn from the student by ensembling its temporal model weights and the student model weights. Such model weights ensembling method ensures the diversity of the teacher model, together with the predictions ensembling improving the predictions for unlabeled images. (2) Instead of using hard-label as target to train the student, our method uses soft-label to retain the information from the unlabeled images as much as possible, which is more informative and efficient to train the student [35]. (3) We use focal loss to solve the data imbalance problem caused by dense predictions in SSOD.

Ii-D Model ensembling

Model ensembling is an efficient method to improve the performance of a machine learning system because different model holds distinct generalizbility for the same data and an ensemble of multiple models jointly enhance the generalization ability of the whole system. Such methods are widely used in various computer vision applications, for example, in large scale image recognition [44, 17, 19, 2]. Model ensembling often employs multiple models that are either trained with different initialization state or configured with different architectures. As for the SSL which uses a teacher-student framework, a drawback of applying the multiple models ensembling is that the computation complexity increases dramatically for both training and inference. To address this issue, the temporal self-ensembling models have been studied [41, 24, 11]

. This type of methods takes advantages of self-ensembling which aggregates the model weights or a sequential predictions from the latest training epochs. The involvement of a single model during training naturally reduces the computation complexity and model size compared to the previous ensemble methods.

Iii Methodology

In this section, (1) we formalize the definition of SSOD in Section III-A. (2) We reasoning the choice of the detection network in Section III-B. (3) We elaborate the proposed TSE-T model in Section III-C, specifically including the temporal predictions self-ensembling (Section III-C1, the temporal model weights self-ensembling (Section III-C2

), and the loss functions (Section

III-C3).

Iii-a Problem definition

We first summarize the workflow of our SSOD system as detailed in Fig. 1. Based on the KD framework, our system consists of a teacher model and a student model, both of which are initiated from a typical one-stage detection model like the RetinaNet [28] and are pretrained using a fully supervised fashion on a certain amount labeled images. When giving unlabeled images, the teacher model obtains its predictions of category and localization for all possible objects in the images. These predictions are used as targets to guide the training of the student model. On the other words, the teacher distills the knowledge from the unlabeled images to the student. It should be noted that the labeled images are also used to leverage the training of the student model in a fully supervised manner. We omit this supervised part from Fig. 1 for an easy read. At inference time, the trained student model has achieved to the performance of the well-performed teacher model which is deployed to detect objects in unseen images.

Next, we formally define the SSOD. For a dataset , we use

to represent an image and use a vector

to denote the location of a labeled object and its associated category, where represent the object center, height and width of the object bounding box; is the class of the object. For an unlabeled image, we use the teacher prediction as the target (“pseudo” label) represented as , where

is the estimated location of an object and

is the classification probability.

In SSOD, we aim to promote the performance of the student model regularized by the teacher model using the unlabeled images from . Here, we define such regularization between the teacher and student predictions as the consistency regularization , where is the student prediction. For the labeled images, we use to denote the standard supervised loss. The objective of the SSOD is to optimize the student model to minimize both the consistency loss using unlabeled images and the supervised loss using the labeled images.

Iii-B Baseline Detector

In object detection, there are two main classes of models, one-stage detector [28, 31, 38] and two-stage detector [42, 6, 16]. The main distinction is that the two-stage detector employs the region proposal network (RPN) to extract a large number of generic object candidates regardless of their fine-grained categories. Next the non-maximum suppression (NMS) merges the spatially duplicated prediction candidates with a certain amount of overlap. However, the operation of NMS before fine-grained detection is harmful to our settings. The SSOD attempts to preserve the predictions as many as possible for each default location. The intuition behind is that the objective of our TSE-T is to synchronize the learning of the student model and the teacher model, so that the student model can approach the performance of a well-educated teacher model. If the NMS is applied before the fine-grained object detection both in teacher and student model, many confident predictions may be suppressed and the knowledge distillation from the teacher may deteriorate the generalizability of the student model for those predictions. Therefore, in this work, we choose to use the one-stage detector as our detection network, for example, the RetinaNet [28], considering its superior performance as a challenging baseline method. It should be noted that our TSE-T model holds its generalizability in SSOD, because it easy to embed the TSE-T model in other typical one-stage detectors.

Iii-C TSE-T Model

In this section we elaborate the TSE-T model and we show a detailed graphical scheme in Fig. 2.

Iii-C1 Temporal teacher predictions self-ensembling

The key to our TSE-T model is to obtain better teacher predictions for unlabeled images to provide a sound target, i.e.  a well-posed logits or soft labels to regularize the training of the student model. In this way, the student model will be able to generalize the unlabeled images and approximate the performance of the well-performed teacher model. We achieve this firstly by the temporal teacher predictions self-ensembling which aggregates the consecutive teacher predictions from the latest training epochs. In order to enhance the data diversity, we use random image transformation, such as horizontal flip, for each mini-batch during training. This produces a large number of data combination for the unlabeled images to improve the teacher predictions when employing the temporal teacher predictions self-ensembling.

Specifically, at training time, given an unlabeled image , we retrieve its previous teacher predictions from the last epochs . The TSE-T model obtains the current teacher prediction of by averaging of these predictions:

(1)

which can be separately denoted as the self-ensembling of the localization and classification.

(2)

Due to the employment of the data augmentation, we need to align the predictions before ensembling. We implement this by tracing the image orientation during augmentation and flip the predictions back to the original image reference. We have a hyper-parameter in our method, which determines how far we will retrieve the historical predictions. In our method, we set for memory constraints.

Iii-C2 Temporal model weights self-ensembling

In the original temporal ensembling model [11], the teacher and student predictions for unlabeled images are from the same model, which may lead the teacher and student predictions be very close in SSOD. Therefore, our TSE-T model proposes a temporally updated teacher model which is devised as the temporal teacher model weights self-ensembling. We implement this by aggregating the historical teacher model weights and the current student model weights using a momentum term formulated as follows:

(3)

where and denote teacher model weights at current and last training step; denotes the student model weights updated at current training step. is a momentum parameter to leverage the contribution of previous teacher model weights and current student model weights in updating the current teacher model. We set for a smoother evolution of the teacher model [15]. Such model weights ensembling method is also referred to as Exponential Moving Average (EMA). So in Fig. 2, we represent the temporal teacher model weights self-ensembling as EMA model for simplification. The TSE-T model is more advantageous than the teacher model in previous methods because it gradually learns from the student model to enhance itself.

Iii-C3 Loss functions

Total loss: As described in Section III-A, we use and to represent the training objectives for labeled and unlabeled images. We integrate the two terms to formulate the total loss.

(4)

We use the hyper-parameter to leverage the contribution of the supervised and unsupervised loss, which will be detailed in Section IV.

Detection loss: The training objective of object detection is to minimize the prediction errors both for classification and localization. So, we specify the above two objectives as:

(5)

where and denote the supervised loss for the labeled images and consistency loss (unsupervised loss) for the unlabeled images. and are the total predictions in labeled and unlabeled images. The hyper-parameter balances the contribution of the classification and localization loss to the total loss, which will be specified in Section IV.

Classification as focal loss: In our method, we use focal loss to solve the data imbalance problem during training in SSOD. The employment of focal loss has another advantage that aligns the definition of the supervised and unsupervised loss, separately formulated as:

(6)

The loss functions retain the form of the standard cross entropy loss, where and are the teacher prediction and student prediction respectively for the object class.

Localization as Smooth L1 loss: As for the localization, we introduce the Smooth L1 loss both for the supervised localization loss and consistency localization loss.

(7)

where , and are the offsets from the ground-truth, student prediction and teacher prediction to the anchor boxes respectively. We provide an example for the computation of the offsets using the teacher prediction .

(8)

where is the localization of one anchor box, and is the normalized teacher prediction of one object. We here omit the summation over all possible detections for an easy reading.

In our settings of SSOD, we use one-stage detection network and avoid employing the NMS before model ensembling. For the unlabeled images, this encourages the emergence of a large number pairwise teacher-student predictions which remain confident consistency. It should be noted that the accumulation of the well-fitted predictions is probable to suppress the inconsistent prediction pairs that are minority in training examples but should be the main contributors in the loss. Thus we revise the the Smooth L1 function to alleviate this effect, which is formulated as follows. We set by validation.

(9)

In Algorithm 1, we summarize the whole training procedure of our method in the form of pseudo code. Here we omit the fully-supervised pre-training step using labeled images, so we directly start from a convergent detection model and train the student model using the proposed TSE-T model.

   # Student model
   # EMA model
   # Student model weights
   # EMA model weights
   # Training dataset
   # Ground-truth for labeled images
  for  in epochs do
      # data augmentation
     for  in num-mini-batches do
         
         if  is unlabeled then
            # Retrieve and align teacher predictions in last N epochs
            
            # Temporal teacher predictions self-ensembling in Eq. 1
            
         end if
         # Compute total loss using Eq. 4
         
         # Update student model weights by standard SGD
         Update:
         # Temporal teacher model weights self-ensmbling in Eq. 3
         Update:
          # For teacher prediction in next epoch
         
     end for
     # Update teacher predictions for next epoch
     [
  end for
Algorithm 1 Pseudocode of TSE-T model

Iv Expreiments

In this section, we conduct experiments to evaluate the performance of the proposed TSE-T model for SSOD. We use two standard benchmarks for object bounding box localization, the VOC [9] and COCO [29]. As for the competing methods, we use the fully-supervised RetinaNet as a strong baseline method. We also use the state of the art method CSD [21] for a challenging comparison to show the merit of our method under the semi-supervised setup. We implement our method based on the maskrcnn benchmark [10]. For a fair comparison, we train, validate and test the RetinaNet using the implementations from the same maskrcnn benchmark as well.

Iv-a Configurations

Datasets For the VOC benchmark, we choose to use the VOC2007 and VOC2012, both of which consist of 20 annotated semantic object classes. For the COCO challenge which has 80 semantic classes, we follow the standard experimental protocol [1, 27, 28] which uses the COCO trainval35k split. In VOC and COCO datasets, there are separately two subsets that are not provided with ground-truth annotations, i.e. the VOC2012 test set and the COCO extra set. So, we use these two subsets as the unlabeled images. To enable the evaluation, we use the VOC2007 test set and COCO2012 test-dev set as our test data. In Table I we show detailed information of the datasets.

Experimental setup We conduct all the experiments using 4 NVIDIA 1080 Ti GPU cards. We use standard SGD optimizer and set the batch size as 8. For the backbone network of RetinaNet, we choose to use ResNet-50 [17] for the experiments on VOC dataset, and we will validate the performance of ResNe50 and ResNet-101 for the experiments on COCO dataset.

When pretraining the detection model using labeled images, we use 15 epochs and initialize learning rate as 0.005 which is divided by 10 at epoch 5 and epoch 8 separately. When training the student model using unlabeled images, we use 13 epochs and initialize the learning rate as 0.0005 which is divided by 10 at epoch 10. It has been found that for an SSOD system, once the model converges at a local minimum, it will be difficult to reach a global solution in the following training steps. So we carefully design the update strategy for . In this work, we aim at a stable transition from full-supervised training to semi-supervised training by slowly increasing the weights of the unlabeled data. We thus gradually increase from 0.02 to 1.6 and from 0.01 to 0.08 for ResNet-50 and ResNet-101 backbone networks respectively. As for , we choose the value of 0.07 and 0.1 separately for ResNet-50 and ResNet-101 backbone networks, which modulates the classification and localization loss at a similar scale.

[width=6em,trim=l] DatasetFold Train Val Train/Val Test Extra
VOC2007 2,501 2,510 5011 4,952*
VOC2012 5,717 5,823 10,540 10,991**
COCO 80,000 35,000 115,000 5,000* 123,403**
  • Test set in our experiments

  • Unlabeled images in our experiments

TABLE I: Datasets Statistics

Iv-B Experiments on VOC dataset

In this experiment, we first fix the VOC2007 test set as the test data, because this subset contains ground-truth annotations that can be used to quantify the performance of a method. For the training set, we devise several different strategies and the results are shown in Table II. In the table, we use the symbol of checkmark to indicate that the dataset is used as labeled images in supervised manner; we use the abbreviation SS (Semi-S

upervised) to indicate that the dataset is used as unlabeled images in semi-supervised manner. We use the standard mean average precision (mAP) as the evaluation metric on the VOC dataset. We show the absolute increase of the evaluation results for the CSD and TSE-T compared to the corresponding baselines.

Model 07train/val 12train/val 12test mAP
RetinaNet 71.56
CSD-SSD-300 SS SS 72.30
CSD-SSD-512 SS SS 75.80
CSD-RFCN SS SS 74.70
CSD-SSD-512 SS SS 75.80
TSE-T SS 76.68
TSE-T SS SS 77.24
TSE-T SS 80.73
TABLE II: Performance Evaluation on VOCf dataset

From the experimental results, we can draw the following conclusions. (1) The employment of unlabeled images in SSOD indeed improves the performance of a detection model. And the more unlabeled images are used, the better an SSOD model performs, resulting a dramatic absolute performance improvement compared to the fully-supervised detection model trained using the same amount of labeled images. (2) Comparing the results of TSE-T and CSD model, we can see that our TSE-T model trained using less unlabeled images already outperforms the best performed CSD model by . When using the exactly the same training data, the performance of our TSE-T model achieves which exceeds the best performed CSD model by . (3) By employing more labeled images which means that a better pretrained model is provided, the mAP of our TSE-T achieves which exceeds CSD model by . This is a remarkable improvement and achieves the state of the art performance on the VOC benchmark under the semi-supervised setup.

Model 07train/val 12train/val 12test Ensemble EMA Detection Loss mAP
RetinaNet 71.56
TSE-T SS 72.45
TSE-T SS 75.11
TSE-T SS 76.24
TSE-T SS 76.68
RetinaNet 71.56
TSE-T SS SS 73.14
TSE-T SS SS 76.05
TSE-T SS SS 76.98
TSE-T SS SS 77.24
RetinaNet 78.36
TSE-T SS 78.87
TSE-T SS 79.37
TSE-T SS 80.35
TSE-T SS 80.73
TABLE III: Ablation Study on VOC dataset

Iv-C Ablation study on VOC dataset

In this experiment, we validate the effectiveness of the basic modules in our TSE-T model, i.e. (a) the temporal teacher predictions self-ensembling which we denote as “Ensemble”, (b) the temporal teacher model weights ensembling which we denote as “EMA”, and (c) the customized detection loss based consistency regularization which we denote as “Detection loss”. For the comparison, (a) we use the teacher predictions from the latest training epoch as targets to train the student model when the “EMA” is omitted; (b) we use the temporal student predictions self-ensembling as the teacher prediction when the “Ensemble” is omitted. (c) And we use the standard Euclidean distance as consistency regularization loss when the “Detection loss” is omitted. We show the object detection results in Table III.

From the results, (1) we can see that each of the basic modules in our TES-T independently improves the performance of SSOD under various training conditions. The concurrency of these basic modules results in the best performance of the detection model. This means that the performance of our TSE-T model is not limited by the upper-bound performance of each basic module; Instead, the intrinsic integration of the proposed modules cooperatively leads to an additive improvement of our method. (2) When using more unlabeled images, the performance of our TSE-T model is further improved, the mAP increasing from to . When using more labeled images, TSE-T model gains a large-margin improvement from to . (3) These results suggest that solely increasing the quantity of unlabeled images for an SSOD system may lead the performance improvement to reach a local maximum. Under this situation, the employment of a certain amount of labeled images may guide the detection model to escape from this dilemma. The key factor behind is that the extra labeled images lifts the lower-bound of our TSE-T model, namely, the performance of the pretrained RetinaNet gains an improvement.

Iv-D Experiments on COCO dataset

Model Backbone train val extra AP AP AP AP AP AP
RetinaNet Resnet50 34.51 53.26 36.54 17.96 37.29 46.56
TSE-T Resnet50 SS 35.42 53.88 37.40 18.87 40.16 48.70
RetinaNet Resnet50 36.34 55.22 38.90 19.66 39.94 48.95
TSE-T Resnet50 SS 36.96 55.70 39.42 19.59 40.76 50.12
RetinaNet Resnet101 39.03 58.31 41.66 22.01 42.83 51.87
TSE-T Resnet101 SS 40.14 59.58 42.78 23.93 44.70 50.99
TSE-T* Resnet101 SS 40.52 59.93 43.48 24.13 45.47 52.97
  • Use extra training image augmentation, i.e. random image resizing.

TABLE IV: Performance Evaluation of Varying Backbone Networks on COCO dataset

Considering that the COCO dataset is a more challenging object detection benchmark, we conduct experiments to find out an efficient backbone network for the detection network. Here, we choose to use Resnet50 and Resnet101 for comparison. In Table IV, we show the experimental results of the fully-supervised detector and our TSE-T model under various training data setups. We use the standard evaluation metrics for COCO dataset to illustrate the results.

From the results, we can draw the following conclusions. (1) By comparing TSE-T with RetinaNet, we can see that our TSE-T model outperforms its fully-supervised counterpart on COCO benchmark when using the same backbone network in detection model. It suggests that our TSE-T model is generic to SSOD regardless the specific type of backbone network. (2) When using more labeled images to train the RetinaNet on COCO dataset, its performance obtains a remarkable improvement, which accordingly improves the lower-bound performance of our TSE-T model. (3) A deeper backbone network in the detection model, like the Resenet101, further improves the performance of our TSE-T model. To further improve the performance of our method, we use random image resizing as augmentation when employing unlabeled images during training. The results are shown in the last row in Table IV indicated by a asterisk. In the end, the AP of our TES-T model has achieved which exceeds the fully-supervised baseline by .

In Fig. 5, we visualize and compare the detection results from both VOC and COCO datasets.

V Conclusions

We propose the temporal self-ensembling teacher model (TSE-T) based on the KD framework, a generic and novel training strategy, to improve the performance of the SSOD. We aim to formulate better teacher predictions for unlabeled images, which regularizes the student model to achieve the performance of the well-trained teacher model. Specifically, the TSE-T model includes two types of temporal self-ensembling strategies, i.e.  the temporal teacher predictions self-ensembling and the temporal teacher model weights self-ensembling. Both jointly increase the data and model diversity, yielding better teacher predictions for unlabeled images which are used as stable targets to train the student model. Moreover, for the consistent regularization term in SSOD, we used the focal loss to mitigate the data imbalance problem and proposed an improved version of Smooth L1 loss as localization loss, which made the model easier to train using unlabeled images and aligned the definition of the loss functions for labeled and unlabeled images. Experimental results have shown that our method set the state of the art performance on the VOC2007 test set and has obtained a dramatic improvement on COCO2012 test-dev set, the mAP of which achieved and separately. A possible direction to further improve our work may refer to a balance between ensembling multiple heterogeneous models and training efficiency. On the other hand, we could take the categorical balance in unlabeled images into account and apply other augmentations to leverage the detection of objects with large scales.

(a) Detection results from VOC2001 test set
(b) Detection results from COCO2021 test-dev set
Fig. 5: Detection results comparison of TSE-T model and its fully supervised counterpart, the RetinaNet, on VOC and COCO datasets. The green bounding boxes indicate the detections from the RetinaNet, and the blue bounding boxes denote the detections of our TSE-T model. We arrange the detection results of the same image side by side for a convenient read. For each dataset, we show the the examples from following cases: The TSE-T model recalls difficult objects (Top row); The TSE-T model alleviates false positives (Middle row); The TSE-T may fail to detect the objects in extreme cases (Bottom row).

References

  • [1] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick (2016)

    Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks

    .
    In CVPR, pp. 2874–2883. Cited by: §IV-A.
  • [2] W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler (2018)

    The power of ensembles for active learning in image classification

    .
    In CVPR, pp. 9368–9377. Cited by: §II-D.
  • [3] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In CVPR, pp. 6154–6162. Cited by: §I, §I.
  • [4] O. Chapelle, B. Scholkopf, and A. Zien (2009) Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20 (3), pp. 542–542. Cited by: §I, §II-B.
  • [5] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In CVPR, pp. 1907–1915. Cited by: §I.
  • [6] J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In NeurIPS, pp. 379–387. Cited by: §III-B.
  • [7] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In ICCV, pp. 2051–2060. Cited by: §I.
  • [8] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) Centernet: keypoint triplets for object detection. In ICCV, pp. 6569–6578. Cited by: §II-A.
  • [9] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §I, §I, §IV.
  • [10] FAIR (2018)(Website) External Links: Link Cited by: §IV.
  • [11] G. French, M. Mackiewicz, and M. Fisher (2017) Self-ensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208. Cited by: §I, §II-D, §III-C2.
  • [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587. Cited by: §II-A.
  • [13] R. Girshick (2015) Fast r-cnn. In ICCV, pp. 1440–1448. Cited by: §II-A.
  • [14] P. Goyal, D. Mahajan, A. Gupta, and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. In ICCV, pp. 6391–6400. Cited by: §I.
  • [15] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §I, §III-C2.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: §I, §I, §II-A, §III-B.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §II-D, §IV-A.
  • [18] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §I, §I, §II-B.
  • [19] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §II-D.
  • [20] L. Huang, Y. Yang, Y. Deng, and Y. Yu (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874. Cited by: §II-A.
  • [21] J. Jeong, S. Lee, J. Kim, and N. Kwak (2019) Consistency-based semi-supervised learning for object detection. In NeurIPS, pp. 10758–10767. Cited by: §I, §I, §IV.
  • [22] A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. In CVPR, pp. 1920–1929. Cited by: §I.
  • [23] A. Kuznetsova, H. Rom, N. Alldrin, J. R. R. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari (2018) The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR abs/1811.00982. External Links: Link, 1811.00982 Cited by: §I.
  • [24] S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §I, §I, §I, §II-B, §II-D.
  • [25] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In ECCV, pp. 734–750. Cited by: §I, §II-A.
  • [26] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3, pp. 2. Cited by: §I, §I.
  • [27] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §II-A, §IV-A.
  • [28] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §I, §I, §I, §II-A, §III-A, §III-B, §IV-A.
  • [29] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §I, §I, §IV.
  • [30] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen (2020) Deep learning for generic object detection: a survey. IJCV 128 (2), pp. 261–318. Cited by: §I, §I, §II-A.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §I, §I, §II-A, §III-B.
  • [32] I. Masi, Y. Wu, T. Hassner, and P. Natarajan (2018)

    Deep face recognition: a survey

    .
    In SIBGRAPI, pp. 471–478. Cited by: §I.
  • [33] S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova, H. Ashrafian, T. Back, M. Chesus, G. C. Corrado, A. Darzi, et al. (2020) International evaluation of an ai system for breast cancer screening. Nature 577 (7788), pp. 89–94. Cited by: §I.
  • [34] M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2015) Is object localization for free?-weakly-supervised learning with convolutional neural networks. In CVPR, pp. 685–694. Cited by: §I.
  • [35] M. Phuong and C. Lampert (2019) Towards understanding knowledge distillation. In ICML, pp. 5142–5151. Cited by: §I, §II-C.
  • [36] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He (2018) Data distillation: towards omni-supervised learning. In CVPR, pp. 4119–4128. Cited by: §I, §I, §II-C.
  • [37] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko (2015) Semi-supervised learning with ladder networks. In NeurIPS, pp. 3546–3554. Cited by: §I, §II-B.
  • [38] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In CVPR, pp. 779–788. Cited by: §I, §II-A, §III-B.
  • [39] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In CVPR, pp. 7263–7271. Cited by: §II-A.
  • [40] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §II-A.
  • [41] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich (2014) Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596. Cited by: §II-D.
  • [42] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §I, §I, §II-A, §III-B.
  • [43] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §I.
  • [44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §II-B, §II-D.
  • [45] M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NeurIPS, pp. 1163–1171. Cited by: §I.
  • [46] P. Tang, C. Ramaiah, R. Xu, and C. Xiong (2020) Proposal learning for semi-supervised object detection. arXiv preprint arXiv:2001.05086. Cited by: §I, §II-C.
  • [47] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, pp. 1195–1204. Cited by: §I, §I, §I, §II-B.
  • [48] P. Viola and M. Jones (2001) Rapid object detection using a boosted cascade of simple features. In CVPR, Vol. 1, pp. I–I. Cited by: §II-A.
  • [49] F. Wan, P. Wei, J. Jiao, Z. Han, and Q. Ye (2018) Min-entropy latent model for weakly supervised object detection. In CVPR, pp. 1297–1306. Cited by: §I.
  • [50] D. Zhang, J. Han, L. Zhao, and D. Meng (2019) Leveraging prior-knowledge for weakly supervised object detection under a collaborative self-paced curriculum learning framework. IJCV 127 (4), pp. 363–380. Cited by: §I.
  • [51] L. Zheng, Y. Yang, and A. G. Hauptmann (2016) Person re-identification: past, present and future. arXiv preprint arXiv:1610.02984. Cited by: §I.
  • [52] X. Zhou, J. Zhuo, and P. Krahenbuhl (2019) Bottom-up object detection by grouping extreme and center points. In CVPR, pp. 850–859. Cited by: §II-A.
  • [53] C. Zhu, Y. He, and M. Savvides (2019) Feature selective anchor-free module for single-shot object detection. In CVPR, pp. 840–849. Cited by: §II-A.
  • [54] X. Zhu and A. B. Goldberg (2009) Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning 3 (1), pp. 1–130. Cited by: §I, §II-B.
  • [55] X. J. Zhu (2005) Semi-supervised learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §I, §II-B.