Interpolation-based semi-supervised learning for object detection

by   Jisoo Jeong, et al.
Seoul National University

Despite the data labeling cost for the object detection tasks being substantially more than that of the classification tasks, semi-supervised learning methods for object detection have not been studied much. In this paper, we propose an Interpolation-based Semi-supervised learning method for object Detection (ISD), which considers and solves the problems caused by applying conventional Interpolation Regularization (IR) directly to object detection. We divide the output of the model into two types according to the objectness scores of both original patches that are mixed in IR. Then, we apply semi-supervised learning methods suitable for each type. This method dramatically improves the performance of semi-supervised learning as well as supervised learning. In the semi-supervised learning setting, our algorithm improves the current state-of-the-art performance on benchmark dataset (PASCAL VOC07 as labeled data and PASCAL VOC12 as unlabeled data) and benchmark architectures (SSD300 and SSD512). In the supervised learning setting, our method, trained with VOC07 as labeled data, improves the baseline methods by a significant margin, as well as shows better performance than the model that is trained using the previous state-of-the-art semi-supervised learning method using VOC07 as the labeled data and VOC12 + MSCOCO as the unlabeled data. Code is available at: .


page 2

page 4


FROST: Faster and more Robust One-shot Semi-supervised Training

Recent advances in one-shot semi-supervised learning have lowered the ba...

Data Distillation: Towards Omni-Supervised Learning

We investigate omni-supervised learning, a special regime of semi-superv...

Semi-supervised Learning for Dense Object Detection in Retail Scenes

Retail scenes usually contain densely packed high number of objects in e...

Semi-supervised 3D Object Detection via Temporal Graph Neural Networks

3D object detection plays an important role in autonomous driving and ot...

Semi-supervised Learning with Deterministic Labeling and Large Margin Projection

The centrality and diversity of the labeled data are very influential to...

Read classification using semi-supervised deep learning

In this paper, we propose a semi-supervised deep learning method for det...

SimMatch: Semi-supervised Learning with Similarity Matching

Learning with few labeled data has been a longstanding problem in the co...

Code Repositories

1 Introduction

A dataset for object detection is much harder to create than the one for classification. While there is only one class in a single image for the classification task, there are multiple objects with different class labels in a single image for the object detection task. Therefore, the dataset for supervised object detection requires a pair of a class label and bounding box information for each object. Labeling each object takes more than a few seconds, and creating these datasets can take hundreds of hours russakovsky2015best ; bearman2016s ; dollar2012pedestrian .

Due to the higher time and resource complexity for creating object detection datasets, recently, methods for learning with weakly labeled data () or unlabeled data () have been studied as opposed to learning with only the labeled data ()111 where , where , and . Here, is the number of images, and is the number of objects for the image .. There are mainly three types of this kind of object detection methods: weakly-supervised, semi-supervised, and weakly-semi-supervised learning. Weakly-supervised learning trains a model with a dataset that has only class information but no location information () zhu2017soft ; shi2017weakly ; jie2017deep ; wang2018collaborative ; kim2019tell . On the other hand, weakly-semi-supervised learning is a learning method which uses as well as tang2016large ; yan2017weakly . Weakly-semi-supervised detector improves its performance compared to that of weakly-supervised learning, but it still needs to label classes for . Instead of , semi-supervised learning utilizes unlabeled data as well as the labeled data () wang2018towards ; Nguyen2019em ; jeong2019consistency .

In this paper, we address the semi-supervised object detection problem and propose a new method called Interpolation-based Semi-supervised learning for object Detection (ISD) which can also be applied to the supervised learning framework. Interpolation Regularization (IR) which mixes different images and learns to predict the combined label rather than one hot vector performs outstandingly in supervised learning as well as in semi-supervised learning for classification problems

zhang2017mixup ; verma2018manifold ; verma2019interpolation ; berthelot2019mixmatch ; Verma2019GraphMixRT . However, it is challenging to apply IR directly to object detection because of the existence of the background class that has very diverse and irregular texture. Fig. 1 shows an example. In Fig. 1(a), we mix image and using the mixing parameter as shown in the middle. Obviously, the mixed green box has 0.5 of dog and 0.5 of bird as can be seen in Fig. 1(b). However, when an object is mixed with a background as can be seen in Fig. 1(c), the mixed image appears to be a 100% object corrupted by noise. If the detector is trained by the conventional IR, any blurred or noisy mixture images contribute to increasing the confidence of the background class, and it will degrade performance.

Figure 1: (a) mixed image created by random interpolation between images A and B (b) Type-I : both patches are from object classes. (c) Type-II : one of the patches is from the background class.

To tackle this problem, in this paper, we divide the mixed images into two types (Type-I and II) considering whether one of the original image is the background or not. Then, we apply a different IR algorithm suitable for each type. The proposed ISD method which will be detailed in Sec. 3 can be combined with conventional semi-supervised learning methods such as CSD (consistency-based semi-supervised learning) jeong2019consistency to produce state-of-the-art semi-supervised object detection performances. Also, the proposed scheme can be used to enhance the detection performance in the supervised learning framework. Our main contributions can be summarized as follows:

We show the problem in applying interpolation regularization directly to the object detection task and propose a novel interpolation-based semi-supervised learning algorithm for object detection.

In doing so, we define two types of interpolation cases in the object detection task and propose efficient semi-supervised learning methods suitable for each type.

We experimentally show the effectiveness of the proposed method for each type by demonstrating a significant performance improvement over the conventional algorithms achieving SOTA semi-supervised object detection performance.

The proposed method can also be applied to the framework of supervised learning, improving the detection performance significantly.

2 Related Work

Interpolation-based regularization is a promising approach due to its state-of-the-art performances and virtually no additional computational cost. These methods construct additional training samples by combining two or more training samples. Mixup zhang2017mixup and Between-class learning tokozume2017betweenclass are the earliest works that took steps in this direction. These methods are based on the principle that the output of a supervised network for an affine combination of two training samples should change linearly. Such kind of inductive bias can be induced in a network by training it on the synthetic samples constructed by mixing two samples and their corresponding targets. Manifold Mixup verma2018manifold mixes features in the deeper layers instead of input images. Other works such as CutMix cutmix construct the synthetic samples by mixing the CutOut devries2017cutout

versions of two samples. Overall, these approaches can be interpreted as a form of data-augmentation technique that seeks to construct additional training samples by combining two or more samples. In the unsupervised learning setting, interpolation-based regularizers have been explored in ACAI

acai and AMR beckham2019adversarial . These methods learn better unsupervised representations by enforcing a constraint that the representations obtained by mixing the representations of two samples should correspond to a data point on the data manifold.

Semi-supervised learning (SSL)

is a dominant approach for machine learning when the annotated data is scarce. There has been recent surge of interest in deep learning based on SSL for object classification

verma2019interpolation ; berthelot2019mixmatch ; Verma2019GraphMixRT . These methods can be broadly categorized into: (1) consistency regularization methods (2) generative adversarial networks (GAN) based methods. Consistency regularization methods are more appealing due to their simplicity, training stability and state-of-the-art performance.

The central idea of the consistency regularization methods is to enforce that the model predictions should not change under reasonable permutations to the input. For object classification, such permutations entail random translation, random cropping and horizontal flipping etc. Let us assume that and are the original and the permuted inputs, be a distance function, be a weighting function over iterations and be a function on which consistency loss is measured, then the consistency loss is computed in an unsupervised manner and consequently the total loss is given by a linear combination of the consistency loss and the supervised loss as follows:


Some notable examples of consistency training include model laine2016temporal , virtual adversarial training miyato2018virtual and Mean Teacher tarvainen2017mean . The recent advances in this direction includes interpolation consistency training (ICT) verma2019interpolation (its variants MixMatch berthelot2019mixmatch , ReMixMatch Berthelot2020ReMixMatch ) and FixMatch sohn2020fixmatch .

ICT is a specific type of consistency regularization that encourages the prediction at an interpolation of unlabeled samples to be consistent with the interpolation of the predictions at those samples. FixMatch uses another form of consistency regularization, where the model’s prediction on “weak augmentation” are encouraged to be consistent with the “strong augmentation”. For weak augmentation, FixMatch uses horizontal flipping, random translation and cropping, and for strong augmentation it uses Cutout devries2017cutout , RandAugmentcubuk2019randaugment and CTAugmentBerthelot2020ReMixMatch .

Semi-supervised learning for object detection has recently been studied in jeong2019consistency where CSD, the first consistency-regularization-based semi-supervised object detection method, was proposed. They explored the consistency between the box predictions in the original and the horizontally flipped version. To prevent the ‘background’ class from dominating the consistency loss in Eq. (1

), they proposed the Background Elimination (BE) method which excludes boxes with high background probability in the computation of the consistency loss. In this paper, we also utilize the BE using the class probability of each candidate box.

3 Method

Figure 2: The proposed ISD loss for each type.

We denote a horizontally flipped version of an image as , and the image created by random mixing between two images A and B as . Similar to Mixup, the mixing coefficient is drawn from the distribution. In our method, we use SSD liu2016ssd , one of the most popular single-stage object detectors, as a detector. In the training of SSD, we add the newly proposed interpolation-based consistency regularization loss in combination with the flip-based consistency regularization loss in jeong2019consistency to enhance the performance. The network output of SSD is denoted as the output of the layer of the pyramid, row, column and default box, and () is expressed as for brevity. Each is composed of and which are the softmax output vector and the localization offsets of the center and the size of the box, [], at position , respectively. The mask , which is computed by , is used in background elimination and interpolation type categorization for image and has the binary objectness value at each location :


3.1 Interpolation-based Semi-supervised learning for Object Detection (ISD)

Type categorization. We determine the type of a pair of patches by the background elimination method jeong2019consistency that only extracts patches with a high objectness probability. Then we apply different methods appropriate for each type of patches. Eq. (3) is how we calculate each type of a mask. The Type-I mask, , is calculated by element-wise multiplication of and . In other words, it becomes 1 when both patches of and are 1, and otherwise it is 0. On the other hand, the Type-II mask () is calculated by element-wise multiplication of and , which means it is 1 when the patch in image has a high objectness score while the corresponding patch at the same location in image has a high background score.


Type I loss: When the patches in the image and are all likely to be objects (Type-I), we define a Type-I loss inspired by the ICT loss verma2019interpolation . Note that there are two differences compared to conventional ICT. First, we used Jensen-Shannon divergence

(JSD) as the consistency regularization loss (function

in Eq. (1)). Second, we use the same network to feed-forward inputs like CSD, distinct from ICT which uses different networks for mixed and original inputs using MeanTeacher tarvainen2017mean . Eq. (4) shows the loss function of Type-I, which is the distance between the mixed output of and and the output of the mixed image of and , .


The overall Type-I loss is the average of patches whose Type-I mask is 1, i.e. Here, and are the expectation and the indicator function, respectively.

Type II loss: As shown in Fig. 2, in Type II, one patch has a high probability of foreground, while the other has a high probability of background. In this case, instead of using the Type I loss described above, we train the network to have similar predictions on the mixed patch and the patch with high probability of foreground. This kind of loss can be interpreted as a form of FixMatch loss sohn2020fixmatch which encourages consistency between the predictions on the strong augmentation and the weak augmentation of an input. This parallel can be seen by considering the mixed patch as a strong augmentation and patch with high foreground probability as no-augmentation. Note that, for classification, FixMatch is trained with targets by creating pseudo-labeling of samples that exceed the threshold, whereas we do not need to set a specific threshold and the target is set to the output distribution of no-augmentation patch.

We set or as a target, and train the mixed output () to be close to or . In doing so, Kullback-Leibler (KL) divergence and L2 loss are used as the classification and localization losses, respectively as follows:


The overall Type-II loss when patch is foreground, , is calculated as the average of the sum of two individual losses as . Likewise, is also calculated by applying the above loss, and the total loss of Type-II is calculated as .

Finally, the overall ISD loss is computed by Type-I loss () and Type-II loss () as follows:


Figure 3: Combination of ISD with CSD. The original images () are flipped () and the mixed images () are obtained by combining the two. First, the order of flipped images are changed by shuffling ()), then and are mixed (). CSD loss is calculated between and and ISD loss is computed between and ( and/or ). In the original set (), the blue box () is labeled, to which the supervised loss is applied.

3.2 Combination of ISD with CSD

For ISD training, three sets of image batches, , , and should be inferred by the network. For efficient training, we set the batches and as the original images and their horizontally flipped versions as shown in Fig. 3, between which the CSD loss is applied. However, if an image and its horizontal flipped version are mixed, the mixed image would have similar backgrounds and will have the same class in the center of the image. Therefore, as shown in Fig. 3, we make the mixed images by combining the original batch () with the half-shuffled flipped batch ()). The total loss consists of supervised loss (), CSD loss (), and ISD loss () as follows:


where is a weight scheduling function.

4 Experiments

We set up the experiment in the same environment as that of the conventional semi-supervised learning methods for object detection. Similar to wang2018towards ; jeong2019consistency , we trained our model with PASCAL VOC07 trainval (5k images) dataset everingham2010pascal as labeled data and PASCAL VOC12 trainval (12k images) MSCOCO trainval (123k images) dataset lin2014microsoft as unlabeled data and tested with PASCAL VOC07 test dataset. PASCAL VOC and MS COCO data consist of 20 and 80 classes, respectively. For the unlabeled MSCOCO dataset, we experiment with MSCOCO (full) and MSCOCO (VOC) that consists of only VOC classes in each image as in jeong2019consistency . We sample the mixing parameter from Beta(, ) at every iteration. The parameters are set to {, } = {0.1, 1} in eq. (6) and

= 5 in beta distribution. Since the number of samples of Type-I is less than the number of samples of Type-II, we set

as 0.1 to reduce the weight in each sample in Type-I. Other learning parameters such as learning rate and batch size are the same as jeong2019consistency

. In our experiment, we report the mean and standard deviation of the results of three runs.

Method Labeled Data Backbone Network mAP (%) Speed (FPS)

SSD300 liu2016ssd ; jeong2019consistency
VOC07 VGG16 70.2 1
SSD300 (CSD) jeong2019consistency VOC07 VGG16 69.3
SSD300 (ISD) VOC07 VGG16 72.63 0.67
SSD300 (ISD + CSD) VOC07 VGG16 72.73 0.12
SSD300 liu2016ssd ; jeong2019consistency VOC07 + VOC12 VGG16 77.2 1
SSD300 (ISD + CSD) VOC07 + VOC12 VGG16 78.60 0.10
RSSD300 jeong2017enhancement VOC07 + VOC12 VGG16 78.5 0.57
DSSD321 fu2017dssd VOC07 + VOC12 ResNet-101 78.6 0.21

SSD512 liu2016ssd ; jeong2019consistency
VOC07 + VOC12 VGG16 79.6 1
SSD512 (ISD + CSD) VOC07 + VOC12 VGG16 81.17 0.15
RSSD512 jeong2017enhancement VOC07 + VOC12 VGG16 80.8 0.66
DSSD513 fu2017dssd VOC07 + VOC12 ResNet-101 81.5 0.29

Table 1: Detection results for PASCAL VOC2007 test set under the supervised training setting.
SSL Labeled Unlabeled mAP (%)
Algorithm data data SSD 300 SSD 512

VOC07 - 70.2 73.3
Learning liu2016ssd ; jeong2019consistency VOC07 + VOC12 - 77.2 79.6
CSD jeong2019consistency VOC07 VOC12 72.3 75.8
Ours (ISD only) VOC07 VOC12 73.27 0.50 76.37 0.45
Ours (ISD + CSD) VOC07 VOC12 74.20 0.10 76.77 0.06
Table 2: Detection results for PASCAL VOC2007 test set under the semi-supervised training setting. The following experiments use VOC07 (labeled) and VOC12 (unlabeled) data. Blue and Red are represented as the Baseline score and Best score, respectively.

4.1 Supervised Learning

We start by examining the effect of ISD on SSD300 in the supervised training setting. The results are presented in Table 1. In the first row block, SSD300 (base) trained with VOC 07 (trainval) data shows 70.2 mAP performance, while that of SSD300 (CSD) decreases to 69.3 mAP, which shows a clear side effect of over-regularization. On the other hand, SSD300 (ISD) and SSD300 (ISD + CSD) show 2.43% and 2.53% improvements in accuracy compared to SSD300 (base), respectively. Note also that the standard deviation of ISD+CSD is quite lower than that of ISD only. This shows that combining ISD with a strong CSD regularizer stabilizes the training, making the network more robust to random batches and random choice of mixing parameter .

In the second and the third row blocks of Table 1, SSD300 (base) and SSD512 (base) trained with VOC 07+VOC12 (trainval) data show 77.2% and 79.6%, respectively. SSD300 (ISD+CSD) shows 1.4% of enhancement and SSD512 (ISD+CSD) shows 1.5% of enhancement. We compared our algorithm to those of other SSDs with similar losses and a few different structures. RSSD jeong2017enhancement and DSSD fu2017dssd are models that efficiently change the feature pyramid to improve the performance of SSD, but at the cost of degraded training and inference speed.222Since the GPU type, implementation, and criterion of speed measurement are all different for each literature, we show the speed in each paper as a relative ratio of speed drop compared to that of SSD. The SSD300 (ISD + CSD) model shows a performance improvement close to the RSSD300 and DSSD321 without changing the network structure, and the SSD512 (ISD + CSD) model shows better performance than the RSSD512. SSD512 equipped with our ISD+CSD in the training stage achieves comparable performance to that of DSSD (less than 0.5% difference). This is particularly appealing because at inference time, DSSD513 is about 3 to 5 times slower than SSD. Interestingly, ISD training of SSD300 in a fully supervised manner using just the VOC07 dataset outperforms previous state-of-the-art semi-supervised learning method (CSD) that use VOC07 as labeled data as well as VOC12 as unlabeled data (compare row 3 of Table 1 and row 1 of Table 3).

4.2 Semi-Supervised Learning

We evaluate the performance of ISD in the SSL setting. As shown in Table 2, the performance of the SSD300 model trained only with VOC07 labeled data is 70.2%. The performance of SSD300 model with VOC07 labeled data and VOC12 unlabeled data for CSD (previous state-of-the-art method), ISD and ISD+CSD is 72.3%, 73.27%, and 74.20% respectively. This demonstrates the effectiveness of our approach in SSL setting. Moreover, ISD+CSD with VOC07 labeled data and VOC12 unlabeled data on SSD300 (Table 2, last row) shows 1.47% performance improvement in comparison to the fully supervised setting with VOC07 labeled data on SSD300 (Tabel 1, row 4). This demonstrates that the performance gain is not due to ISD+CSD applied only on the labeled data; unsupervised versions of ISD+CSD loss are crucial for better performance. Similar to SSD300, for SSD512, ISD shows significant improvement over CSD, with ISD + CSD achieving the best performance. We further extend the experimental analysis by using the MSCOCO dataset in addition to the VOC12 dataset as the unlabeled samples. The results shown in Table 3 demonstrate that across all the unlabeled datasets, our ISD+CSD approach outperforms baseline CSD-only approach by a significant margin.

Detector Labeled Unlabeled mAP (%)
data data CSD jeong2019consistency Ours (ISD + CSD)
SSD300 VOC07 VOC12 72.3 74.20 0.10
VOC12 + MSCOCO (full) 71.7 73.57 0.15
VOC12 + MSCOCO(VOC) 72.6 74.40 0.10
Table 3: Detection results for PASCAL VOC2007 test set. The following experiments use VOC07 (labeled) and VOC12 & MSCOCO (unlabeled) data.

5 Discussion

Ablation studies for Type-I and Type-II : We experiment to verify the performance of each type, and each column in Table 5 represents the different types of ISD methods. We added each type of ISD loss on the CSD baseline, which reduces standard deviation of the experimental result. When we apply Type-I loss, there is a little improvement from CSD (72.3%) where is 2. On the other hand, when type-II loss is applied, the performance increases significantly compared to Type-I and increases most when is 5. There are three reasons for the above results. First, the numbers of Type-I and Type-II samples are different. With a trained model, the number of Type-II samples were 5 times that of Type-I samples, which indicates that the influence of Type-I loss is relatively small. Second, Type-I only considers the classification loss. While the boundaries of the two object that created the mixed patch are in different locations, the boundary of the mixed patch is not interpolated. Therefore, the localization loss cannot be applied in Type-I. Third, two objects that are mixed may not be aligned well. More research is needed for the alignment in Interpolation Regularization, which remains as a future work. Finally, combining Type-I and Type-II improves the performance on all values.

In Table 5, we analyzed the effect of the classification and the localization loss in Type-II when is 5. The classification loss on Type-II samples showed more remarkable performance improvement than the localization loss, and by combining them, we can obtain better performance.

Beta distribution : In ISD, the mixing coefficient is sampled from the distribution. Table 5 shows the performance of ISD using various values of across different types of ISD losses. We observe that a large range of gives improved performance in comparison to the baseline (CSD with 72.3% mAP), showing that ISD is not very sensitive to the value of . In general, we recommend to set to a sufficiently large value such as in the range of [2,5]. The reason for choosing relatively large is as follows: With a smaller values of (e.g. ), will be close to either 0 or 1 with high probability, thus most of the mixed images will be closer to either of the original images being mixed. In this case, the mixed image

will be extremely weak (for one image) or strong (for the other) augmentation resulting in lowered performance with high variance. In contrast, increasing the values of

increases the probability of being closer to 0.5, which provides an appropriate level of regularization. Note that if the value of is too large, will be concentrated too much around 0.5 (e.g. for , ) and all the augmented samples will be too different from the original images resulting in degraded performance with high variance at test time.

(, ) SSD300 + ISD Method (mAP (%))
Type-I Type-II Type-I + Type-II
1 72.07 0.32 72.97 0.06 73.40 0.36
2 72.57 0.15 73.80 0.26 73.87 0.15
5 72.47 0.21 74.07 0.15 74.20 0.10
10 72.00 0.40 73.37 1.01 73.87 0.55
Table 5: Ablation study of Type-II losses on PASCAL VOC2007 test set. All the experiments in this table are performed by adding each loss to the CSD. ( is 5 and CSD-only performed 72.3%).
VOC07(L)+VOC12(U) mAP (%)
Type-II (cls) 73.70 0.20
Type-II (loc) 73.17 0.15
Type-II (cls + loc) 74.07 0.15
Table 4: Ablation study for and each type in VOC07(L) + VOC12(U) training dataset and VOC07 testing dataset. The row represents the of the beta distribution, and the column represents each type. All the experiments in this table are performed by adding each loss to the CSD. (CSD-only performed 72.3%).

Training model size : For ISD training, image batches are inferred by the network three times over conventional SSD. Also, due to the calculation of additional losses, it requires more than three times the conventional SSD memory. We used Nvidia 1080Ti GPU, and we assigned 4 and 8 GPUs for SSD300 and SSD512 models with ISD training, respectively. With fewer GPUs, our implementation was not trainable because of limited memory budget. However, at testing, it has the same network size and inference time as the base network and can improve the performance.

Object detector : In this paper, we have used the SSD model among single stage detectors. In the case of other detectors, altorithm-specific modifications should be made to successfully apply interpolation regularization. However, the basic idea of separating Type-I and Type-II samples and applying different loss for each case can still be valid. In the case of a Two-Stage detector, for example, Region of Interest (RoI) is obtained by Region Proposal Network (RPN) and classification of that location is performed for object detection. Since the RoIs of , , , and are all different, in order to apply our algorithm, one of RoIs should be applied to other images for one-to-one correspondence. If the RoI of is applied to other images, Type-II loss between and cannot be obtained, and if each RoI of A, B, is applied individually to other images, a lot of computation will be required. Thus how to apply interpolation-based regularizer for Two-stage detectors is an interesting avenue for research.

6 Conclusion

In this paper, we have proposed ISD, a simple and efficient Interpolation-based semi-supervised learning algorithm for object detection using single-stage detectors. We started by investigating the challenges that occur when the Interpolation Regularization methods for the classification task are applied directly to an object detection task, and have addressed these challenges by proposing different types of interpolation-based loss functions. Our method shows significant improvement in both semi-supervised and supervised object detection tasks over the previous state-of-the-art methods, compared over the same dataset and the same architecture settings. We further demonstrate that combining ISD with the previous method of CSD can further improve the performance and advance the current state-of-the-art. We leave the exploration of Interpolation Regularization for Two-stage detectors as a future work.

7 Statement of the potential broader impact of this work

ISD is a fundamental algorithm for object detection in images, which could conceivably used for any application that requires object detection. It allows for training with unlabeled data, which may make it more useful for those with only a small amount of labeled data and make it more widely usable, and may facilitate new applications of object detection. For example, extending ISD to the medical vision application can be particularly useful, since the labeled data is usually scarce in these domains. Furthermore, it might make smaller organizations or non-profits which have less budget for collecting labeled data more competitive with larger organizations. In general, improving performance for the object detection task could have a variety of applications, which could be positive, negative, or more complicated, but would depend on the nature of the organization using them and what task they use them for.