. In recent years, object detection greatly benefits from the rapid development of deep convolutional neural networks (CNNs)[42, 25, 44, 27], whose performance is heavily dependent on a mass of labeled training images. However, a large-scale well-annotated dataset for object detection is much more laborious and costly, compared with annotation of the whole images for classification. One potential way to efficiently acquire more training images is augmentation of the existing datasets or generation of synthetic images in an effective way.
In terms of data augmentation for object detection, traditional methods perform geometrical transformations (e.g., horizontal flipping , multi-scale strategy , patch crop  and random erasing ) on the original images to vary their spatial structures. However, these methods hardly change the visual content and context of objects, bringing little increase in the diversity of training dataset. Recently, some researches [21, 15, 18, 12, 32] show generation of new synthetic images in Cut-Paste
manner is a potential way to augment detection datasets. Specifically, given the cutting foreground objects, these methods paste them on some background images, where the pasted positions are estimated by considering 3D scene geometry, in a random manner , by predicting support surfaces  or by modeling visual context [12, 32]. Meanwhile, above Cut-Paste methods [12, 32] show appropriate visual context plays a key role in data augmentation for object detection. However, it is very difficult for the existing Cut-Paste methods to precisely model visual context for the whole time, although some prediction mechanisms are carefully designed. Furthermore, they require external datasets for collecting foreground/background images or both of them.
To overcome above issues, this paper proposes a simple yet efficient instance-switching (IS) strategy. Specifically, given a pair of training images containing instances of same class, our IS method generates a new pair of training images by switching the instances of same class by taking shape and scale information of instances into account. As illustrated in Figure 1 (a), our IS strategy can always preserve contextual coherence in the original images. Meanwhile, a detector (i.e., Faster R-CNN  with ResNet-101 ) trained on original dataset can correctly detect objects in the original images, but miss the switched instances (i.e., plant, denoted by red dashed boxes) in synthetic images, indicating IS has ability to increase the diversity of samples. In addition, our IS clearly requires no external datasets, and naturally keeps more coherence in visual context, compared with methods those paste the cutting foreground objects on a new background [21, 15, 12].
The attendant problem is what criteria do we follow to perform IS? We first make analysis using Faster R-CNN with ResNet-101 on the most widely used MS COCO . Figure 1 (b) and (c) show distribution of instances in each class and average precision (AP) of each class, respectively. We can clearly see that distribution of instances is long-tailed and detection performance of each class is quite different. Many works demonstrate data imbalance brings side effect on image classification [7, 11, 26, 29, 8, 48]. Meanwhile, detection difficulties of different classes greatly vary, and distribution of instances and detection performance of each class do not always behave in the same way. Above issues of instance imbalance and class importance frequently occur in real-world applications and inevitably bring adverse effect on detection performance. However, few object detection method is concerned with these issues and the simple IS also lacks the ability to handle issue of instance imbalance and consider class importance in datasets. To this end, we explore a progressive and selective scheme for our IS strategy to generate synthetic images for object detection while training detectors with a class-balanced loss. The resulted method is called Progressive and Selective Instance-Switching (PSIS), which enhances instance balance by selectively performing IS (i.e., inversely proportional to the original instance frequency) for adjusting distribution of instances [4, 22, 3] and combining with a class-balanced loss . To consider class importance, PSIS augments training dataset in a progressive manner, which is guided by detection performance (i.e., increase of times of IS for classes with lowest APs). The experiments are conducted on MS COCO dataset  to evaluate out PSIS.
The contributions of this paper are summarized as follows: (1) We propose a simple yet efficient instance-switching (IS) strategy to augment training data for object detection. The proposed IS can increase the diversity of samples, while preserving contextual coherence in the original images and requiring no external datasets. (2) We propose a novel Progressive and Selective Instance-Switching (PSIS) method to guide our IS strategy can enhance instance balance and consider class importance, which are ubiquity but few detection methods pay attention to. Our PSIS can further improve detection performance. (3) We thoroughly evaluate the proposed PSIS on challenging MS COCO benchmark, and experimental results show our PSIS is superior and complementary to existing data augmentation methods and brings clear improvement over various state-of-the-art detectors (e.g., Faster R-CNN , Mask R-CNN  and SNIPER ), while our PSIS has the potential to improve performance of instance segmentation .
2 Related Work
Our work focuses on data augmentation for object detection. Albeit many data augmentation methods are proposed for deep CNNs in image classification task [9, 28, 45, 49, 10], it is not clear whether they are suitable for object detection or not. The traditional data augmentation for object detection performs geometrical transformations on images [39, 36, 17, 23], and some works use random occlusion mask to dropout some part of images  or learn occlusion mask on convolutional feature maps  for generating hard positive samples. An alternative solution is to generate new synthetic images in Cut-Paste manner [21, 15, 18, 12, 32]. Compared with above methods, our method can better increase the diversity of samples than traditional ones, and model more precise visual context than Cut-Paste methods in an efficient way. Generative Adversarial Networks (GANs) [20, 6, 46, 1] have shown promising ability for image synthesis. However, these methods are unable to automatically annotate bounding boxes of objects, so that they cannot be directly used for object detection. The recently proposed compositional GAN  introduces a self-consistent composition-decomposition network to generate images by combining two instances with explicitly learning possible interactions, but it is incapable of generating images with complex context. Remea  use a copy-and-paste GAN to generate images for weakly segmentation tasks. However, synthetic images generated by GANs usually suffer from a clear domain gap with the real images, bringing the side effect for subsequent applications .
The way of our PSIS method to enhance instance balance is related to works those handle the issue of class imbalance in image classification [8, 11, 48, 26, 4, 22]. Among them, methods [4, 22, 3] balance distribution of classes by re-sampling for those containing less training samples. Another kind of methods perform sample re-weighting by introducing some cost-sensitive losses [26, 29, 40, 8], which can alleviate the over-fitting caused by duplicated samples in re-sampling process. Wang  propose an on-line method to adaptively fuse the sampling strategy with loss learning using two level curriculum schedulers. But differently, our PSIS focuses on balance of instances rather than images, as one image may contain multiple instances. Moreover, we combine re-sampling with sample re-weighting for better enhancing instance balance. Additionally, Focal loss  is proposed to handle issue of class imbalance in one-stage detectors, but it only focuses on two classes (i.e., foreground and background). Our PSIS employs a progressive augmentation process, which is related to . The latter proposes a Drop and Pick training policy to select samples for reducing training computation in image classification task. In contrast, our PSIS progressively increases number of instances for classes with lowest APs.
3 Proposed Method
In this section, we introduce the proposed Progressive and Selective Instance-Switching (PSIS) method. We first describe our Instance-Switching (IS) strategy for synthetic images generation. Then, selective re-sampling and class-balanced loss are introduced to enhance instance balance. Additionally, we perform IS in a progressive manner for considering class importance. Finally, we show the overview of our PSIS for object detection.
3.1 Instance-Switching for Image Generation
Previous works [12, 32] show appropriate visual context plays a key role in synthetic images for object detection. To precisely model visual context of synthetic images while increasing the diversity of training samples, this paper proposes an instance-switching (IS) strategy for synthetic image generation. Figure 2 illustrates the core idea of our IS for image generation. Let be the training set of class (e.g., teddy bear) involving images. To guarantee visual context of synthetic images as coherence as possible, we define a candidate set (indicted by ) based on segmentation masks of instances111In this paper, our experiments are conducted on MS COCO dataset , where segmentation mask of each instance can be reliably obtained using COCO API. Note that all existing Cut-Paste based data augmentation methods for object detection employ segmented objects or masks [21, 15, 18, 12]..
Specifically, our consists of a set of quadruples ,, where , , and satisfy the following conditions:
where and are two images in ; and are instances of label in images and , respectively. and are two functions matching shapes and scales of instances and . Let and be masks (binary images) of and , we align them and rescale them to the same size, then the normalized masks are denoted as and . Accordingly, the function can be computed based on sum square difference (ssd) [51, 31], i.e.,
where indicates area of , counted by the number of one in . is a maximum function. Then, the function has
According to Eqns. (1), (2) and (3), consists of a set of quadruples ,, where instances and have similar shapes (i.e., shape differences are less than threshold ) and are in controlled scaling ratios (i.e., scaling ratios range from to ). Given the candidate set , we can generate new training images , by switching instances and through rescaling and cut-paste. Finally, we follow  to employ Gaussian blurring for smoothing the boundary artifacts. Once the quadruple , completes IS, we will remove them from to avoid duplicated sampling. Clearly, our IS method can preserve contextual coherence in the original images and requires no external datasets.
3.2 Enhancement of Instance Balance
The data in many real-world applications are long-tailed distributions. As shown in Figure 1, the most widely used object detection dataset  also suffers from extreme imbalance distribution of instances, bringing the challenge for detection methods. To alleviate this issue, we propose to exploit a selective re-sampling strategy for balancing distribution of instances and a class-balanced loss  for avoiding over-fitting of re-sampling.
3.2.1 Selective Re-sampling Strategy
To obtain an instance-balanced training dataset, a natural scheme is to select the same number of quadruples in each , and perform IS to generate synthetic images. The resulted dataset is dubbed by , and the selected set of quadruples in -th class is indicated by . Comparing Figure 1 (b) with Figure 3 (a), distribution of instances in is more balanced than one in original training dataset . However, one image usually contains different instances of various numbers, so distribution of instances in the equally sampled dataset is still far away from a uniform one. Hence, we propose a selective re-sampling strategy to further adjust distribution of instances in .
Let small and large classes indicate the classes with fewer and more instances, respectively. Our selective re-sampling strategy is developed based on a straightforward rule: (1) On the one hand, we pick more images for small classes, where each picked image should contain instances of small classes as many as possible while involving instances of large classes as few as possible; (2) On the other hand, we drop images for large classes, where the dropped images have opposite situation to the picked ones. Intuitively, if we persistently carry out above drop-pick process, distribution of instances in will tend to be uniform.
In our case, we drop or pick the image if it satisfies the following conditions:
where and are numbers of instances in and , respectively; and indicate number of instances of -th class and number of all instances in , respectively. Furthermore, it is difficult to directly handle all classes as number of classes increases. So we carry out above drop-pick within two stages. In the first stage, we handle maximum and minimum classes, and the drop-pick process stops when percentage of instances in minimum and maximum classes are 1.5 times and half of original ones. Once drop-pick process stops, maximum and minimum classes will be changed, so we iteratively carry out this process 20 times. In the second stage, we extend above process to all classes. After our selective re-sampling strategy, as shown in Figure 3 (b), distribution of instances in is adjusted to be approximately uniform, which is indicated by .
3.2.2 Class-Balanced Loss
Above selective re-sampling strategy can generate a synthetic training dataset where distribution of instances is approximately uniform, but it is still very difficult to make the whole training datasets (i.e.,
) obey a uniform distribution. Moreover, over-sampling easily generates the duplicated images, leading to over-fitting. The similar phenomenon is also found in our experiments (refer to Section 4.1.2 for details). To this end, we exploit the recently proposed class-balanced loss  to re-weight instances, which reformulates the softmax loss by introducing a weight that is inversely proportional to the number of instances in each class. Given an instance with label
and the prediction vector, where is the number of classes. The class-balanced loss can be computed as:
where is the number of instances belonging to the class of , and is a regularized constant to control the effect of number of instances on final prediction. In particular, Eqn. (5) will degenerate to the standard softmax loss if , and will enlarge the effect of number of instances, which is evaluated in Section 4.1.2.
3.3 Progressive Instance-Switching
In above section, we introduce two strategies to enhance instance balance. As shown in Figure 1
(b) and (c), detection performance (i.e., AP) of each class is quite different, while AP of each class does not always show positive correlation with distribution of instances. Above observations encourage us to pay more attention on these classes with lowest APs for further improving detection performance. To this end, we introduce a progressive instance-switching for data augmentation, which generates more training images for the classes with lowest APs in a self-supervised learning manner.
Specifically, we first train a detector using a predefined training dataset . After
training epochs, we evaluate the current detector on the validation set and pick upclasses with the lowest APs. In order to control the number of augmented images for avoiding over-fitting and breaking of balance, we perform selective instance-switching (Section 3.2.1) to generate training images for -th class ( is one of classes of lowest APs) based on the proportion :
where and are number of classes and image percentage to be augmented, which are decided by cross validation. indicates rank of AP of -th class sorted in a descending order, e.g, if -th class has the lowest AP. The new generated dataset is denoted by , given it we go on training the current detector using union set of and . According to Eqn. (6), our progressive instance-switching method can augment more training images for the classes with lower APs in a controlled range, which takes class importance into consideration.
3.4 PSIS for Object Detection
So far, we have introduced our proposed progressive and selective instance-switching (PSIS) method. Finally, we show how to apply our PSIS for object detection, which is summarized in Algorithm 1. Specifically, we first generate a uniform augmented dataset based on the original training dataset as descried in Section 3.2.1. Then we combine with as initial training dataset , and employ to train the detector with the class balance loss (5). After training epochs, we augment the using the current detector following the rules in Section 3.3, and continue training of detector on the union set of and . Finally, our PSIS algorithm outputs a object detector and an augmented dataset . As shown in our experiments, the augmented dataset can be flexibly adopted to other detectors or tasks.
|Training sets||Detector||Avg.Precision, IOU:||Avg.Precision, Area:||Avg.Recall, #Det:||Avg.Recall, Area:|
|Faster R-CNN (ResNet-101) ||27.3||48.6||27.5||8.8||30.4||43.1||25.8||38.5||39.3||16.3||44.5||60.2|
In this section, we evaluate our PSIS method on the challenging MS COCO 2017 , which is the most widely used benchmark for object detection, containing about 118k training, 5k validation and 40k testing images collected from 80 classes. We first make ablation studies using Faster R-CNN  with ResNet-101 . For determining candidate sets for our instance-switching, we fix the parameters , and in Eqn. (1) to , and , respectively. We train Faster R-CNN (ResNet-101) within 14 epochs (i.e., in Algorithm 1) and set to . The initial learning rate is set to , and decreases with a factor of 10 after 11 training epochs. The horizontal flipping is used for data augmentation. After generating augmented dataset () by our PSIS and Faster R-CNN, we directly adopt to four state-of-the-art detectors (i.e., FPN , Mask R-CNN , SNIPER  and BlitzNet ) for comparing with the related works. Finally, we verify the generalization ability of on instance segmentation task. For all evaluated detectors, Faster R-CNN, SNIPER and BlitzNet are implemented using the source codes released by the respective authors. We employ the publicly available toolkit  to implement FPN  and Mask R-CNN 
. All programs run on a server equipped with four NVIDIA GTX 1080Ti GPUs. Following the standard evaluation metric (e.g., mean Average Precision (mAP) under), we report results on validation set and - 2017 evaluation server for ablation studies and comparison, respectively.
4.1 Ablation studies
In this subsection, we assess the effects of different components on our PSIS, including instance-switching strategy, instance balance enhancement and progressive instance-switching. To this end, we train Faster R-CNN  on various training datasets under the exactly same experimental settings, and report the results on the validation set.
4.1.1 Instance-switching Strategy
Using our instance-switching strategy in the equally sampling manner, we can generate a synthetic dataset , which shares the same size (i.e., 118k) with the original training dataset . As compared in the top two rows of Table 1, combination of with can achieve consistent improvements over single under all evaluation metrics. In particular, Faster R-CNN (ResNet-101) trained with outperforms one with by 1.1% in terms of mAP under . This clear improvement verifies the effectiveness of instance-switching strategy as data augmentation for object detection, and we owe the improvement to increase of diversity and keep of visual contextual coherence inherent in our instance-switching strategy.
To further verify effect of our instance-switching (IS) strategy, we make a statistical analysis. Specifically, we randomly pick images with swappable instances (i.e., and ) from training () and validation () sets of MS COCO 2017, respectively. Then, we perform times IS on and . The image sets containing switched instances of and are indicted by and , respectively. As listed in Table 3, Faster R-CNN trained on misses 76 and 117 more instances on and than those on and , respectively. In contrast, Faster R-CNN trained on can reduce number of the missed instances, specifically on and . The result indicates our IS can change the context of switched instances, thereby increasing the diversity of samples and leading the missing of pre-trained detector for switched instances, and it can be well suppressed by training detector on .
|Training sets||Detectors||Avg.Precision, IOU:||Avg.Precision, Area:||Avg.Recall, #Det:||Avg.Recall, Area:|
|Mask R-CNN ||39.4||61.0||43.3||23.1||43.7||51.3||32.3||51.5||54.3||34.9||58.7||68.5|
4.1.2 Instance Balance Enhancement
|Trained on||146||222 (+76)||614||731 (+117)|
|Trained on||133 (-13)||142 (-80)||607 (-7)||634 (-97)|
Our PSIS introduces two strategies to enhance instance balance, including selective re-sampling strategy and class-balanced loss. As shown in Figure 3 (b), our selective re-sampling strategy is helpful to generate an approximately uniform training dataset , which has the same size (i.e., 118k) with and . As listed in Table 1, brings 1.4% and 0.3% gains over and in terms of mAP under , respectively. Combining with class-balanced loss , detection performance can further be improved to 29.0%. Through introducing instance balance enhancement into our instance-switching strategy, we obtain 0.6% improvement over that performs instance-switching in a non-uniform manner, indicating instance balance is an important property of dataset for better detection. Furthermore, we evaluate the effect of parameter on class-balanced loss . In general, indicates re-weighting instances in much larger inverse proportion to instance number. As illustrated in Figure 4, achieves the best result, outperforming (i.e., the classical softmax loss) by 0.3%. The smaller () leads performance decline. Therefore, we set in the following experiments.
4.1.3 Progressive Instance-switching
To take the class importance into account, we propose a progressive instance-switching method for data augmentation. As listed in the bottom of Table 1, our progressive manner () brings 0.7% gain in terms of mAP under , demonstrating consideration of class importance is helpful to augment data for object detection. Note that our PSIS can achieve in total 2.4% improvement over the original dataset under a strong detector (e.g., Faster R-CNN + ResNet-101). In addition, our progressive instance-switching (6) involves two parameters, i.e., class number and image percentage to be augmented. Figure 5 shows the results of using progressive instance-switching with various and . From it we can see that with obtains the best performance. Furthermore, generation of training images for more classes with larger proportions leads performance decline. It may be account for breaking of instance balance and over-fitting caused by over-sampling. Finally, our involves about 283k training images in total, including of 118k, of 118k and of 47k.
4.2 Apply to State-of-the-art Detectors
In above subsection, we have generated an augmented dataset based on Faster R-CNN, Then, we directly employ this dataset to train four state-of-the-art detectors (i.e., FPN , Mask R-CNN , BlitzNet  and SNIPER ), and report results on test server for comparing with other augmentation methods.
4.2.1 PSIS for FPN
We firstly adopt our to FPN  with RoI Alignment layer  under Faster R-CNN framework . We employ ResNet-101 as backbone model, and train FPN following the same settings in . Specifically, we train FPN on within 12 epochs, and set the initial learning rate to , which is decreased with a factor of 10 after 8 and 11 epochs. Due to containing more images, FPN is trained on with 18 epochs and learning rate is decreased after 13 and 16 epochs. Here we compare with two widely used data augmentation methods, i.e., horizontal flipping and training-time augmentation .
The comparison results are given in Table 2, from it we can see that our PSIS and horizontal flipping bring and gains over non-augmented one, respectively. We further combine our PSIS with horizontal flipping, which achieves 39.8% mAP under , outperforming horizontal flipping and non-augmented one by and , respectively. We owe these gains to increase of sample diversity inherent in our PSIS. Besides, we also compare with training-time augmentation method (i.e., training epochs). Note that our PSIS without training-time augmentation is superior to training-time augmentation method by 0.4%. By performing training-time augmentation, our PSIS further achieves 0.8% improvement. Above results clearly demonstrate our PSIS is superior and complementary to horizontal flipping and training-time augmentation methods.
4.2.2 PSIS for Mask R-CNN
We evaluate our PSIS using Mask R-CNN , which also exploits mask information to improve object detection accuracy while achieving state-of-the-art performance. Here, we employ ResNet-101 as backbone model, and train Mask R-CNN using  and employing the same settings with FPN. We compare our PSIS with two augmentation manners, i.e., segmentation mask and training-time augmentation. The results are compared in Table 2, where we can see that Mask R-CNN trained with our consistently outperforms one with the original dataset under all evaluation metrics, although Mask R-CNN exploits instance masks as well. In particular, our PSIS brings 1.3% gains under , demonstrating the complementarity between mask branch of Mask R-CNN and our PSIS. For training-time augmentation, is better than with 2 times training epochs. Combining with training-time augmentation, PSIS is further improved by 0.5% . Above results show again our PSIS is superior and complementary to training-time augmentation method.
4.2.3 PSIS for SNIPER
SNIPER  is a recently proposed high-performance detector, which handles object detection in an efficient multi-scale manner. To verify the effectiveness of our PSIS under multi-scale training strategy, we employ ResNet-101 as backbone model and train SNIPER on and following the experimental settings in  except batch-size and training epochs. Due to the limited computing resources, we use half of batch-size and double training epochs in the original settings, and we do not use negative chip mining for . As compared in Table 2, SNIPER trained on our achieves 0.8% gain over one with the original training dataset, showing the proposed PSIS is complementary to multi-scale strategy and it can be flexibly adopted to multi-scale training/testing detectors.
|Training sets||Detectors||Avg.Precision, IOU:||Avg.Precision, Area:||Avg.Recall, #Det:||Avg.Recall, Area:|
|Mask R-CNN ||40.2||61.2||44.4||23.2||44.1||51.8||33.0||52.3||54.9||35.5||59.2||69.2|
|Training sets||Method||Avg.Precision, IOU:||Avg.Precision, Area:||Avg.Recall, #Det:||Avg.Recall, Area:|
|Mask R-CNN ||35.9||57.7||38.4||19.2||39.7||49.7||30.5||47.3||49.6||29.7||53.8||65.8|
4.2.4 PSIS for BlitzNet
Recently, Dvornik propose a context-based data augmentation (Context-DA) method for object detection , where a real-time single-shot detector (dubbed by BlitzNet ) is used to evaluate performance of Context-DA. In comparison to Context-DA , we also adopt our PSIS to BlitzNet. Following the settings in , we train BlitzNet using ResNet-50 as backbone model, while set batch-size and training epochs to half of and 2 times original ones. Note that the original BlitzNet runs on MS COCO 2014, so we rebuild our based on the training set of MS COCO 2014. The results are compared in Table 7. Clearly, both Context-DA and our PSIS improve the original dataset . Meanwhile, our PSIS outperforms Context-DA by 2.8% under . We owe this improvement to that our PSIS can preserve contextual coherence in the original images and benefit from more appropriate visual context.
4.3 Longer Training Time on
As described in Section 4.2, our PSIS exploits more training epochs. To evaluate its effect, we also train FPN  and Mask R-CNN  on the original set longer time. Specifically, we train the detectors within 18 (same with ) and 44 (considering effects of both more data and longer training) epochs on , which are indicted by and , respectively. The results are given in Table 4, where our PSIS () outperforms both and . Moreover, brings little gains over , indicating more training epochs lead over-fitting on . Our PSIS avoids over-fitting by increasing diversity of samples and benefits from longer training-time.
Due to limited computing resources, we use half of batch-size and double training epochs of original settings when training SNIPER and BlitzNet on . To assess its effect, we report the results of SNIPER and BlitzNet trained on with the same settings in Table 10 and Table 11. They obtain 43.2% and 27.2% at IoU=[0.5:0.95], respectively. For SNIPER, our PSIS (44.2%) can achieve 1% gain. Meanwhile, our PSIS (30.8%) can achieve 3.6% gains for BlitzNet.
4.4 Generalization to Instance Segmentation
We verify the generalization ability of our PSIS on instance segmentation task of MS COCO 2017. To this end, we train Mask R-CNN  (ResNet-101) following the exactly same settings in Section 4.2.2. As shown in Table 5, our PSIS can improve the original training dataset over and under within and training epochs, respectively. PSIS further achieves gain in training-time augmentation manner. As for Mask R-CNN, it exploits instance masks to simultaneously perform detection and segmentation through a multi-task loss for obtaining better performance. Different from it, our PSIS employs instance masks to augment training data, and the results clearly show our PSIS offers a new and complementary way to use instance masks for improving both detection and segmentation performance. Besides, above results indicate our PSIS is independent on pre-defined detector, and can generalize well to various detectors and tasks.
4.5 Generalization to Small-scale PASCAL VOC
We further verify the generalization ability of PSIS by adopting Faster R-CNN  trained with MS COCO to test set of PASCAL VOC 2007  without any fine-tuning. Here, we train Faster R-CNN on full MS COCO (/) or a subset of MS COCO (/). The latter only contains 20 classes those share with ones of PASCAL VOC. As shown in Table 6, our PSIS respectively obtains 1.3% and 1.6% gains in term of mAP, comparing with ones trained with the original datasets and . Above results demonstrate the improvement achieved by PSIS on MS COCO can be generalized to other dataset.
Furthermore, we directly adopt PSIS to PASCAL VOC . Following the same experimental settings in Context-DA , we use PASCAL VOC 2012 training set that equips with segmentation annotations (including 1464 images) for training and test set of PASCAL VOC 2007 for testing. Meanwhile, Faster R-CNN  and BlitzNet  are used for evaluation. The results are compared in Table 8. For Faster R-CNN, our PSIS obtains 1.2% gains over the original training set . For BlitzNet, PSIS outperforms and Context-DA  () by 2.2% and 0.9%, respectively. These improvements clearly show our PSIS has good generalization ability to different datasets.
In this paper, we proposed a simple yet effective data augmentation for object detection, whose core is a progressive and selective instance-switching (PSIS) method for synthetic image generation. The proposed PSIS as data augmentation for object detection benefits several merits, i.e., increase of diversity of samples, keep of contextual coherence in the original images, no requirement of external datasets, and consideration of instance balance and class importance. Experimental results demonstrate the effectiveness of our PSIS against the existing data augmentation, including horizontal flipping and training time augmentation for FPN, segmentation masks and training time augmentation for Mask R-CNN, multi-scale training strategy for SNIPER, and Context-DA for BlitzNet. The improvement on both object detection and instance segmentation tasks suggest our proposed PSIS has the potential to improve the performance of other applications (i.e., keypoint detection), which will be investigated in future work.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
-  S. Azadi, D. Pathak, S. Ebrahimi, and T. Darrell. Compositional GAN: Learning conditional image composition. arXiv:1807.07560, 2018.
-  M. Buda, A. Maki, and M. A. Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018.
-  N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: synthetic minority over-sampling technique. JAIR, 16:321–357, 2002.
-  K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy, and D. Lin. mmdetection. https://github.com/open-mmlab/mmdetection, 2018.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
-  B. Cheng, Y. Wei, H. Shi, S. Chang, J. Xiong, and T. S. Huang. Revisiting pre-training: An efficient training method for image classification. arXiv:1811.09347, 2018.
-  Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie. Class-balanced loss based on effective number of samples. arXiv:1901.05555, 2019.
-  T. DeVries and G. W. Taylor. Dataset augmentation in feature space. arXiv:1702.05538, 2017.
-  T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552, 2017.
Q. Dong, S. Gong, and X. Zhu.
Imbalanced deep learning by minority class incremental rectification.IEEE T-PAMI, 2018.
-  N. Dvornik, J. Mairal, and C. Schmid. Modeling visual context is key to augmenting object detection datasets. In ECCV, 2018.
-  N. Dvornik, J. Mairal, and C. Schmid. On the importance of visual context for data augmentation in scene understanding. arXiv:1809.02492, 2018.
N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid.
Blitznet: A real-time deep network for scene understanding.In ICCV, 2017.
-  D. Dwibedi, I. Misra, and M. Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In ICCV, 2017.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
-  C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD: Deconvolutional single shot detector. arXiv:1701.06659, 2017.
-  G. Georgakis, A. Mousavian, A. C. Berg, and J. Kosecka. Synthesizing training data for object detection in indoor scenes. arXiv:1702.07836, 2017.
-  G. Gkioxari, R. Girshick, P. Dollár, and K. He. Detecting and recognizing human-object interactions. In CVPR, 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
-  A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In CVPR, 2016.
-  H. He and E. A. Garcia. Learning from imbalanced data. IEEE T-K&DE, (9):1263–1284, 2008.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. IEEE T-PAMI, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  C. Huang, Y. Li, C. Change Loy, and X. Tang. Learning deep representation for imbalanced classification. In CVPR, 2016.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
-  H. Inoue. Data augmentation by pairing samples for images classification. arXiv:1801.02929, 2018.
S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, and R. Togneri.
Cost-sensitive learning of deep feature representations from imbalanced data.IEEE T-NNLS, 29(8), 2018.
-  S. Kumra and C. Kanan. Robotic grasp detection using deep convolutional neural networks. In IROS, 2017.
-  J.-F. Lalonde and A. A. Efros. Using color compatibility for assessing image realism. In ICCV, 2007.
-  D. Lee, S. Liu, J. Gu, M.-Y. Liu, M.-H. Yang, and J. Kautz. Context-aware synthesis and placement of object instances. In NeurIPS, 2018.
-  T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In ICCV, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
-  R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE T-PAMI, 41(1), 2019.
-  T. Remez, J. Huang, and M. Brown. Learning to segment via cut-and-paste. In ECCV, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  N. Sarafianos, X. Xu, and I. A. Kakadiaris. Deep imbalanced attribute classification using visual attention aggregation. In ECCV, 2018.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
-  B. Singh, M. Najibi, and L. S. Davis. SNIPER: Efficient multi-scale training. In NeurIPS, 2018.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, 2017.
-  R. Takahashi, T. Matsubara, and K. Uehara. Data augmentation using random image cropping and patching for deep cnns. arXiv:1811.09030, 2018.
A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu.
Pixel recurrent neural networks.In ICML, 2016.
-  X. Wang, A. Shrivastava, and A. Gupta. A-Fast-RCNN: Hard positive generation via adversary for object detection. In CVPR, 2017.
-  Y. Wang, W. Gan, W. Wu, and J. Yan. Dynamic curriculum learning for imbalanced data classification. arXiv:1901.06783, 2019.
-  H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. ICLR, 2017.
-  Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. arXiv:1708.04896, 2017.
-  J.-Y. Zhu, P. Krahenbuhl, E. Shechtman, and A. A. Efros. Learning a discriminative model for the perception of realism in composite images. In ICCV, 2015.
6 Supplementary Material
|Training sets||Detector||Avg.Precision, IOU:||Avg.Precision, Area:||Avg.Recall, #Det:||Avg.Recall, Area:|
|Faster R-CNN (ResNet-50)||26.2||47.4||26.8||8.3||29.7||41.8||25.5||37.9||39.0||16.0||45.0||59.5|
|Faster R-CNN (ResNet-101)||27.3||48.6||27.5||8.8||30.4||43.1||25.8||38.5||39.3||16.3||44.5||60.2|
|Faster R-CNN (ResNet-152)||28.0||49.0||28.8||9.2||31.3||44.9||26.6||39.7||40.7||18.0||46.9||62.9|
In this supplementary file, we provide more qualitative and quantitative analysis for our proposed PSIS. Specifically, we further verify the effect of progressive instance-switching, and provide more examples of synthetic images generated by our IS strategy. Moreover, we evaluate the effect of backbone models with various depths. Finally, we assess the effects of training settings (i.e., batch-size, epochs and negative chip mining) on BlitzNet  and SNIPER .
6.1 Illustration of Examples of Synthetic Images
Switching and Annotation for Attaching Instances.
Given a quadruple , in the candidate set , our IS strategy switches the instances to obtain a new synthetic image pair ,. However, as shown in Figure 6, some instances (e.g., person ) may be attached on a switched instance (e.g., bed ). To preserve contextual coherence, our IS also switches and annotates the attaching instances during generation of the synthetic images.
Comparison with Context-DA .
We also compare the synthetic images generated by Context-DA and our instance-switching strategy. The synthetic images generated by Context-DA are copied from the original paper , and we pick the synthetic images generated by our instance-switching strategy, sharing the similar scenes with ones of Context-DA. The image examples are illustrated in Figure 7, where we can see that our instance-switching strategy can better preserve contextual coherence in the original images in comparison to Context-DA. Meanwhile, the synthetic images generated by our instance-switching strategy have better visual authenticity.
Examples of Synthetic Images Generated by Our IS
Here we show some examples of synthetic images generated by our IS strategy. As illustrated in Figure 10, the new (switched) instances are denoted in red boxes, and our instance-switching strategy can clearly preserve contextual coherence in the original images.
|Training sets||Detector||Avg.Precision, IOU:||Avg.Precision, Area:||Avg.Recall, #Det:||Avg.Recall, Area:|
|SNIPER (ResNet-101) W/o neg||43.2||62.5||48.7||28.3||45.6||56.7||34.7||59.4||65.3||49.5||70.0||77.1|
|SNIPER (ResNet-101) with neg||46.4||67.2||52.3||30.7||49.5||58.8||36.9||62.1||68.2||52.1||73.3||82.0|
6.2 Further Analysis of Gains for Each Class
Our progressive instance-switching mainly focuses on the classes with lowest APs. To further verify its effect, we conduct experiments using Faster R-CNN (ResNet-101) on MS COCO dataset, where the settings follow Section 4.1. Figure 8 (a) and (b) show AP and gain ratios of each class before and after progressive instance-switching, respectively. The gain ratios are computed by (APAP)/AP, where AP and AP are AP of each class before and after progressive instance-switching. The results clearly demonstrate our progressive instance-switching brings much larger gain ratios for the classes with lowest APs. Meanwhile, it does not bring the side effect on the classes with highest APs, and even slightly improves their performance.
Progressive and Selective Instance-Switching.
To verify the effect of PSIS, we further analysis the gain AP for each class brought by PSIS using Faster R-CNN (ResNet-101) on MS COCO dataset. Figure 9 (a) and (b) show AP of each class on (27.3%) and gains of each class brought by (29.7%), respectively. The gains are computed by APAP, where AP and AP are AP of each class for and . The results clearly indicate our PSIS brings improvement for each class.
6.3 Effect of Backbone Models with Various Depths
In this part, we evaluate the effect of backbone models on our PSIS method. To this end, we conduct experiments using Faster R-CNN  with three networks of different depths, including ResNet-50 , ResNet-101  and ResNet-152 . Faster R-CNN with different backbone models are trained on two datasets (i.e., and ), and results are reported on the validation set for comparison. We employ the same settings in Section 4.1 (ResNet-101 as backbone) to train Faster R-CNN with ResNet-50 and ResNet-152. The results are summarized in Table 9, where our achieves , and gains in terms of over the original under ResNet-50, ResNet-101 and ResNet-152, respectively. Above results demonstrate that our PSIS can improve detection performance under backbone models with various depths. It is worth mentioning that Faster R-CNN with ResNet-50 trained on our is superior to one with ResNet-101 trained on . Meanwhile, Faster R-CNN + ResNet-101 + outperforms Faster R-CNN + ResNet-152 + . They show again the effectiveness of our PSIS under backbone models with various depths.
6.4 Effects of Training Settings on BlitzNet and SNIPER
Half of Batch-size and 2 Times Epochs.
Due to the limited computing resources, we use half of batch-size and double training epochs in the original settings for training both BlitzNet  and SNIPER . Here we evaluate the effect of such training settings on BlitzNet and SNIPER. To this end, we use our training settings (i.e., half of batch-size and double training epochs) to re-train BlitzNet and SNIPER on the original training sets of MS COCO 2014 and MS COCO 2017, respectively. The corresponding method is indicted by . As compared in Tables 10 and 11, we can see that and achieve very comparable results under all evaluation metrics, clearly showing our modified training settings have little effect on performance of BlitzNet and SNIPER. Therefore, we can owe the improvement of over to the effectiveness of our PSIS method.
Negative Chip Mining for SNIPER.
Besides, we implement SNIPER  on our with negative chip mining, which is very time-consuming due to our limited computing resources. As listed in Table 10, SNIPER with our training settings (i.e., half of batch-size and double training epochs) trained on the original dataset achieves very comparable results with the ones reported in the original paper . When negative chip mining is employed, SNIPER trained on our respectively achieves and gain over one with and , achieving state-of-the-art performance on MS COCO 2017 dataset. Above results demonstrate our proposed PSIS is also complementary to negative chip mining strategy.
|Training set||Avg.Precision, IOU:||Avg.Precision, Area:|