Stitcher: Feedback-driven Data Provider for Object Detection

by   Yukang Chen, et al.

Object detectors commonly vary quality according to scales, where the performance on small objects is the least satisfying. In this paper, we investigate this phenomenon and discover that: in the majority of training iterations, small objects contribute barely to the total loss, causing poor performance with imbalanced optimization. Inspired by this finding, we present Stitcher, a feedback-driven data provider, which aims to train object detectors in a balanced way. In Stitcher, images are resized into smaller components and then stitched into the same size to regular images. Stitched images contain inevitable smaller objects, which would be beneficial with our core idea, to exploit the loss statistics as feedback to guide next-iteration update. Experiments have been conducted on various detectors, backbones, training periods, datasets, and even on instance segmentation. Stitcher steadily improves performance by a large margin in all settings, especially for small objects, with nearly no additional computation in both training and testing stages.


page 3

page 4

page 5


Synthesizing Training Data for Object Detection in Indoor Scenes

Detection of objects in cluttered indoor environments is one of the key ...

SM+: Refined Scale Match for Tiny Person Detection

Detecting tiny objects ( e.g., less than 20 x 20 pixels) in large-scale ...

A Simple and Effective Use of Object-Centric Images for Long-Tailed Object Detection

Object frequencies in daily scenes follow a long-tailed distribution. Ma...

Self-supervisory Signals for Object Discovery and Detection

In robotic applications, we often face the challenge of discovering new ...

Inverting and Understanding Object Detectors

As a core problem in computer vision, the performance of object detectio...

AMRNet: Chips Augmentation in Areial Images Object Detection

Detecting object in aerial image is challenging task due to 1) objects a...

1 Introduction

Deep object detector performance varies in a complicated way. A key challenge in deep networks for object detection is the large scale variation, which often emerges to be the difficulty to detect small objects. For instance, in the result of FPN with ResNet-50 [11, 13] released in [1] (AP: 36.7 %, AP: 21.1 %, AP: 39.9 %, AP: 48.1 %), accuracy on small objects is nearly half of that on middle-size or large objects, due to imbalanced optimization on various scales. This problem depresses the overall performance and hinders the generality of object detectors in diverse scenes. A reasonable explanation is that supervisory signals on small objects are insufficient.

Figure 1: Ratio of loss from small objects across training iterations. For the Faster R-CNN baseline, in more than 50% of iterations, small objects contribute less than 10% to the overall loss. Loss distribution gets balanced when Stitcher is adopted
Figure 2: Accuracy versus Training hours on COCO with Faster R-CNN. In various settings, Stitcher improves performance by about 2% AP with nearly no extra training time, while multi-scale training is inferior and requires more cost

Supervisory signals can be naturally reflected by training loss. We study the distributions of loss for different scales over iterations and show them in Fig. 2. The statistics are computed on a common baseline, , Faster R-CNN with ResNet-50 [11] and FPN [13] as backbones on the Microsoft COCO dataset [14]. Objects in small, middle and large scales are defined according to their sizes. Specifically, in iteration , the loss for small objects, , accounts for ground-truth boxes whose sizes are smaller than . denotes the ratio of against the total loss in current iteration. It is noticeable that in more than 50% of iterations, are negligible (less than 0.1), as shown in Fig. 2. Lack of knowledge on small objects leads to the unbalance and the corresponding poor performance.

In this paper, we propose Stitcher, a feedback-driven data provider, that enhances the performance of object detection, by utilizing training loss in a feedback manner. In Stitcher, we introduce stitched images that have the same size as the regular ones and consist of resized smaller components. The core idea is to leverage loss statistics in the current iteration as the feedback to adaptively determine the input choice for the next. As illustrated in Fig. 3, if the ratio of loss for small objects is negligible in current iteration , the input to iteration is the stitched images, where smaller objects are inevitably more abundant. Otherwise, input remains regular images under the default setting. Image stitching mitigates the image level imbalance in the input feature space over the primitive data distribution. Simultaneously, the feedback paradigm alleviates the unfair optimization. They both dedicate to more balanced object detection.

In experiments, we verify the effectiveness of Stitcher on various detection frameworks (Faster R-CNN, RetinaNet), backbones (ResNets, ResNexts), training schedules (1, 2), datasets (COCO, VOC) and even on instance segmentation. In all these settings, our method improves accuracy by a large margin, as shown in Fig. 2, especially for small objects. As Stitcher also involves images in multiple scales, we also compare Stitcher with multi-scale training. The latter requires longer training time but its performance is inferior.

Stitcher can be easily incorporated into any detector. It imposes nearly no extra burden during both training and inference. Additional costs are only due to loss statistics computation and image stitching, which are almost negligible compared to much heavier forward and backward propagation.

In the following, we first analyze existing problems in Section 2 and then introduce our method in Section 3. In Section 4, the relation of Stitcher to previous literature are discussed. Experimental results are presented in Section 5.

Figure 3: The pipeline illustration. Whether to use stitched images in the next iteration is adaptively determined by the current feedback

2 Problem Analysis

Object detectors vary performance dramatically with different scales. In this section, we provide explanations to this phenomenon with experimental analysis.

2.1 Image level Analysis

Quantitative Analysis. Small objects are very common in natural images, while their distributions are not predictable across different images. As illustrated in Table 1, 41.4% of objects in the COCO training set are small objects, much more than those in the other two scales. However, only 52.3% of images contain small objects. In contrast, the proportions for medium and large objects are 70.7% and 83.0% respectively. Put differently, in some images, most objects are small, while nearly half of the images contain no small objects on the contrary. Such severe imbalance hampers the training process.

Qualitative Analysis.

In regular images, objects would probably be blurred for photographic issues,

, out of focus or motion blur. If regular images are resized to be smaller, medium-sized or large objects inside would also become smaller ones, whose contour or details, however, remain clearer than the original small ones. In Fig. 4, the ball, which is resized from a larger scale, is clearer than the kite, although they have similar sizes of and respectively.

The above analysis inspires the component stitching in Section 3.1.

Figure 4: Qualitative comparison between natural small objects (kites) and those (ball) within smaller images obtained by re-scaling. The resized ball is visually clearer than the kite in texture – they share similar sizes
Small Mid Large
Ratio of total boxes (%) 41.4 34.3 24.3
Ratio of images included (%) 52.3 70.7 83.0
Table 1: Distribution of objects in different scales on the MS-COCO training set

2.2 Training Level Analysis

In this section, the scale issue in training level is analyzed through loss statistics. We use the trainval35k split in COCO dataset for training and the minival for evaluation. The ImageNet pre-trained ResNet-50 with FPN is served as the backbone. We train on Faster R-CNN with 1

training period (90k). All training setting, including learning rate, momentum, weight decay and batch size, directly follows the default values. During training, we record loss over three scales in each training iteration. According to these statistics, Fig. 2 illustrates the loss distributions over various scales.

Small objects has uneven distributions over images, which consequently makes the training suffer from further imbalance problem. Even if small objects are included in some images, they still have a chance to be ignored during training. Fig. 2 illustrates that, in more than 50% iterations, small objects account for less than 10% among the total. Training losses are dominated by large and medium-sized objects. Thus, the supervisory signals for small objects are insufficient, which severely harms the small object accuracy and even the overall performance. This phenomenon motivates the selection paradigm in Stitcher, which is introduced in Section 3.2.

Figure 5: Regular images and stitched images. (a) A batch of regular images as training inputs, with shape ; (b) A batch of stitched images, with shape , in one of which quadruple small images are stitched along spatial dimension; (c) A batch of stitched images, with shape , where images are concatenated along batch dimension . We set for visualization

3 Stitcher

According to the previous analysis, imbalance on different scales stems from the distributions in image level and gets worse in the training stage. Inspired by this finding, we delve into this ubiquitous issue in object detection – how to relieve the scale imbalance, especially for small objects. To this end, a novel training strategy, called Stitcher, is presented in this section. It is consists of two different stages in image and training levels, corresponding to the analysis in Section 2.

3.1 Image Level Operations - Component Stitching

Referring to the statistics exhibited in Table 1, nearly half of the images in the training set contain no small object. Such an image-granularity imbalance severely disturbs the mini-batch level optimization process. To resolve this, we propose Stitcher, a self-adaptive data generator which produces either stitched images or regular images dynamically, guided by the penalization signals.

Given a handful of images with resolution resized to be unified, a stitched image is constructed by scaling and collaging () natural images together such that the aspect ratio (of each component) is preserved, , . It is acknowledged to keep the aspect ratio for retaining original object properties. Trivially, a natural image is inducted into the stitched image when taking as 1. Specifying a stitching order of as 4, we visualize an example in Fig. 5 (b). The scale-imbalance of an image batch (acting as a minimal training entity) gets alleviated with the assistance of image stitching by manufacturing more small objects. Since stitched images have identical size to regular ones, no additional computation is introduced in network propagation. Unless specified, experiments of Stitcher are conducted with images in Fig. 5 (b).

The implicit square constrain of the stitching number (

) is left as a major concern. To make it more flexible, we provide another version of implementation where images are stitched along the batch dimension and the batch tensor shape becomes

. Overall, the tensor pixel amount remains unchanged whereas the stitching number is without being imposed on squares of some number. A corresponding example is illustrated in Fig. 5 (c). For detailed choice of stitching order , please refer to Section 5.4. Stitcher provides data with consistent tensor volume but dynamic batch size, which generalizes conventional multi-scale training (fixed batch size).

Figure 6: Loss distribution over different scales for Stitcher and Faster R-CNN Res-50-FPN
Figure 7: Performance curve over training iterations

3.2 Training Level Module - Selection Paradigm

Stitched images potentially contain more small objects while the timing to exploit them depends. Dated back to Fig. 2, in more than 50% of iterations, the loss from small objects accounts for less than 10%. To avoid such undesirable trend, we propose a rectified paradigm, determining the input of next iteration based upon feedback from current pass. If the loss from small objects is negligible (below a threshold ) in iteration , we assume knowledge on small objects is far from enough. To compensate for the lack of information, we adopt the stitched images as input to iteration . Otherwise, regular images are picked.

To calculate the ratio of small objects loss among all scales, we have the following procedure. Strictly speaking, the scale of an object is determined by its mask area, which is only available in segmentation tasks. However, towards generic object detection, ground-truth masks are unavailable. Thus, we use the box area instead. As in Eq. (1), for object , its area is approximately represented as the box area, denoted as . denotes the loss from small objects whose area is no higher than (1,024 in the protocol of COCO). The proportion of small objects is obtained with Eq. (3) as


The ratio serves as a latent feedback to guide the learning of next iteration. Such strategy balances the loss distribution for better optimization. We visualize the loss distributions comparison in Fig. 7 and the performance difference in Fig. 7. We measure the statistics every 10k iterations and illustrate them with smooth. It reveals that loss distributions over various scales get more balanced with our Stitcher, which leads to better accuracy.

3.3 Time Complexity

Only the training process is exploited by Stitcher which thus makes no burden on inference time. Here we elaborate on its time complexity during training.

Stitcher is composed of component stitching and selection paradigm. In the stitching part,

neighbor interpolation is utilized to down-scale images. As the maximum side of original images in COCO are 640 pixels, the interpolation operation requires no more than

M of multiplication. It is neglectable compared to the forward/backward propagation of detection networks. For example, it costs ResNet-50 3.8G FLOPs (multiplication and addition) to process a image. In the selection paradigm, we need to calculate the area of each selected ground-truth boxes. However, there are only a few remained boxed each time. This step costs negligible computation.

In addition to the theoretical analysis, we also measure the running time of Stitcher in practice. If stitched images are demanded, it costs approximately 0.02 seconds extra in this iteration, beyond the regular training. In terms of total training time, it takes about 8.7 hours to train the baseline, Faster R-CNN with ResNet-50-FPN backbone. When being timed on the same GPUs, Stitcher spends about a quarter of an hour longer. This gap shrinks when larger backbones are applied.

4 In Context of Literature

Our work is related to previous ones in several aspects. We discuss the relation and also mainly the differences in the following.

Multi-scale Image Pyramid[2]  This is a traditional and intuitive way to remedy the large scale variation. It has been popular since the era of hand-crafted features, , SIFT [16] and HOG [6]. Nowadays, deep CNN-based object detectors can also benefit from multi-scale training and testing, where images are randomly resized into different resolutions. Thus, features learned in this way are more robust to scale variation.

Similar to multi-scale training, Stitcher is also devised to render features more robustly towards scale variation. However, there are two essential differences.
(1) Stitcher demands neither image pyramid construction nor input size adjustment. Stitched images still have the same size as regular ones, which hugely relieves the computation burden that is otherwise inevitable in image pyramids.
(2) Objects scales in Stitcher are adaptively determined by the loss distribution over training iterations. On the contrary in image pyramids, images of different sizes are randomly chosen in each iteration. This leads to notably better performance than multi-scale training, which will be demonstrated by experimental results in Section 3.

SNIP and SNIPER[17, 18]  These methods are the advanced versions of image pyramid strategy. SNIP [17] was proposed to normalize the scales of objects in multi-scale training. Regions of Interest (RoIs) fall outside the specified range at each scale are considered invalid. SNIPER [18] utilizes patches as training data, instead of regular images. It crops selected regions around the ground-truth instances as positive chips and samples background as negative chips.

The operation of Stitcher is essentially different from that of SNIPER. The crop operation in SNIPER is much more complicated, which requires to calculate the overlaps (IoU) between ground-truth boxes and crops for label assignment. However, the stitch operation in Stitcher only involves interpolation and concatenation. Besides, as SNIP and SNIPER rely on multi-scale testing – they still suffer from an increase of inference time. For ablation upon the effect of testing strategies, we compare their performance with Stitcher in Section 5.2.

Mixup[20]  This Mixup technique was first introduced in image classification [20], to alleviate adversarial perturbation. Afterward, it was evaluated in object detection [21]. It blends image pixels and merges group-truth labels with an empirical mixup ratio for adjustment. In the aspect of operations, stitching and mixup are both concise.

In terms of performance, Stitcher is superior to mixup. As shown in [21], mixup improves the baseline (Faster R-CNN/ResNet-101/2) by 0.2% AP without mixup during pre-training, and by 1.2% AP with mixup on both pre-training and fine-tuning phases. In contrast, Stitcher improves the same baseline by 2.3% AP without applying to the pre-training stage.

Auto Augmentation[4]  This method learns augmentation strategies data-dependently [4, 22]. The search space contains several pre-defined augmentation policies. It utilizes search algorithms,

, reinforcement learning, to optimize data augmentation policies. Afterward, the top-performing ones are eqquipped to re-train networks.

Auto Augmentation [22] costs thousands of GPU days for offline process to complete search, which completely diverges from our goal. The performance of Stitcher, nevertheless, is similar to that of Auto Augmentation [22], (+1.6% by Auto Augmentation vs. +1.7% by Stitcher on Faster R-CNN with ResNet-50 backbone). Also, auto augmentation methods usually involve much more complicated transformations, , color operations (equalization, brightness) and geometric operations (rotation, shearing).

Scale-aware Network

Instead of manipulating images, another line of effort on handling scale variation is to design scale-aware neural networks, which usually fall into one of two categories: feature pyramids based and dilation based methods. In terms of feature pyramids methods, SSD 

[15] detects objects with different scales, taking as input the feature maps at the corresponding layer. FPN [13] introduces lateral connections to build high-level semantic feature maps at all scales. On the other hand, dilation based methods adapt the receptive fields for objects. Deformable convolution networks (DCN) [5] generalizes dilated convolution to adaptive receptive field learning. Trident network [12] constructs multi branches with various dilation to generate scale-specific features.

5 Experiments

5.1 Implementation Details

Experiments are performed on the COCO detection dataset [14] involving 80 categories of objects. We train networks on the union of the primitive training set (containing 80k images) and a subset (35k images) of the primitive validation set. (trainval35k) and evaluate on a 5k subset of validation images (minival).

Following the common practice [10], backbone networks are first pre-trained by image classification on ImageNet dataset [7]. Sequentially, these models, equipped with head sub-networks further, are fine-tuned on COCO supervised with object detection task. For ease of progress, we adopt pre-trained models publicly available111 Our implementation is based on the maskrcnn-benchmark222

In the training stage, input images are resized such that the shorter side has 800 pixels. We train networks on 8 GPUs (RTX 2080 TI) using synchronized SGD. Unless otherwise specified, there are 2 images per GPU in each mini-batch and the whole batch-size is 16. Training settings directly follow the commonly used configurations. We set the weight decay as 0.0001, momentum as 0.9, and the initial learning rate as 0.02. In 1 training period, there are 90k iterations in total. The learning rate is divided by 10 at 60k and 80k. In 2

training period, we double the total iteration number to 180k and the learning rate decline moments correspondingly to 120k and 160k.

For evaluation, we adopt the standard mean average precision (mAP) metric. Average Precisions (APs) corresponding to objects of small, medium, and large sizes (namely, AP, AP, and AP[14] are also reported.

5.2 Evaluation on Object Detection

In this section, we compare Stitcher with common baselines and its competitor, including multi-scale training and SNIP/SNIPER.

period backbone AP AP AP AP
Baseline 1 Res-50-FPN 36.7 21.1 39.9 48.1
Stitcher 38.6 (+1.9) 24.4 (+3.3) 41.9 (+2.0) 49.3 (+1.2)
Baseline Res-101-FPN 39.1 22.6 42.9 51.4
Stitcher 40.8 (+1.7) 25.8 (+3.3) 44.1 (+1.2) 51.9 (+0.5)
Baseline 2 Res-50-FPN 37.7 21.6 40.6 49.6
Stitcher 39.9 (+2.2) 25.1 (+3.5) 43.1 (+2.5) 51.0 (+1.4)
Baseline Res-101-FPN 39.8 22.9 43.3 52.6
Stitcher 42.1 (+2.3) 26.9 (+4.0) 45.5 (+2.2) 54.1 (+1.5)
Table 2: Comparison with common baselines on Faster R-CNN
period backbone AP AP AP AP
Baseline 1 Res-50-FPN 35.7 19.5 39.9 47.5
Stitcher 37.8 (+2.1) 22.1 (+2.6) 41.6 (+1.7) 48.9 (+1.4)
Baseline Res-101-FPN 37.7 20.6 41.8 50.8
Stitcher 39.9 (+2.2) 24.7 (+4.1) 44.1 (+2.3) 51.8 (+1.0)
Baseline 2 Res-50-FPN 36.8 20.2 40.0 49.7
Stitcher 39.0 (+2.2) 23.4 (+3.2) 42.9 (+2.9) 51.0 (+1.2)
Baseline Res-101-FPN 38.8 21.1 42.1 52.4
Stitcher 41.3 (+2.5) 25.4 (+4.3) 45.1 (+3.0) 54.0 (+1.6)
Table 3: Comparison with common baselines on RetinaNet

Comparison with common baselines. We evaluate the effect of Stitcher with different detectors (Faster R-CNN and RetinaNet) and various training periods (1 and 2), and show results in Tables 2 and 3. In all these cases, accuracy steadily improves when Stitcher is leveraged, especially for small objects. In details, we observed the followings.
(1) 2 training yields more increase than 1 training (+1.93% vs. +2.2% on 1 and 2 training respectively) on Faster R-CNN with ResNet-50.
(2) In most cases, performance gain does not decay as backbones enlarged from ResNet-50 [11] to ResNet-101, except Faster R-CNN on 1 training.
In short, the experimental results demonstrate that Stitcher brings consistent improvement over the baselines and is robust to various settings (backbones, detection heads, and times of training periods).

Comparison with Multi-scale Training. Table 4 presents the comparison with the multi-scale training technique, , randomly selecting a scale from [600, 700, 800, 900, 1000] to resize the shorter side of images. Stitcher yields better performance compared with multi-scale training. Apart from the performance gain, the following observations also deserve discussion.
(1) In terms of accuricies, the advantages of Stitcher over mutli-scale training are largely derived in small scales. They have approximately equal ability in detecting large objects. Such contrast validates our achievement towards the devising purpose to benefit mainly small object detection by image stitching.
(2) Stitcher is computationally economical than multi-scale training. They are both trained with the same GPU (RTX 2080Ti). However, for the same training period, multi-scale training spends more time than Stitcher.

period backbone hours AP AP AP AP
Multi-scale 1 Res-50-FPN 10.5 37.2 21.6 40.3 48.6
Stitcher 9.0 38.6 (+1.4) 24.4 (+2.8) 41.9 (+1.6) 49.3 (+0.7)
Multi-scale Res-101-FPN 14.2 39.7 23.6 43.3 51.3
Stitcher 11.7 40.8 (+1.1) 25.8 (+2.2) 44.1 (+0.8) 51.9 (+0.6)
Multi-scale 2 Res-50-FPN 20.5 39.1 23.5 42.2 50.8
Stitcher 17.5 39.9 (+0.8) 25.1 (+1.6) 43.1 (+0.9) 51.0 (+0.2)
Multi-scale Res-101-FPN 28.5 41.6 25.5 45.3 54.1
Stitcher 23.5 42.1 (+0.5) 26.9 (+1.4) 45.5 (+0.2) 54.1 (+0.0)
Table 4: Comparison with multi-scale training on Faster R-CNN
backbone AP AP AP AP AP AP
SNIP Res-50-C4 43.6 65.2 48.8 26.4 46.5 55.8
SNIPER 43.5 65.0 48.6 26.1 46.3 56.0
Stitcher 44.2 64.6 48.4 28.7 47.2 58.3
SNIP Res-101-C4 44.4 66.2 49.9 27.3 47.4 56.9
SNIPER 46.1 67.0 51.6 29.6 48.9 58.1
Stitcher 46.9 67.5 51.4 30.9 50.5 60.9
Table 5: Comparison with SNIP and SNIPER on Faster R-CNN

Comparison with SNIP and SNIPER. As shown in Table 5, we compare Stitcher with SNIP and SNIPER 333For fair comparisons, we use same augmentation strategies as SNIP and SNIPER, which includes deformable convolution, multi-scale testing, and soft-nms [3]. on Faster R-CNN with ResNet-50/101. Stitcher performs slightly better. Both SNIPER and Stitcher can be viewed as multi-scale training. However there exist some distinct differences. At first, Stitcher is simpler for implementation. SNIPER requires label assignment, valid range tuning and positive/negative chip selection. Second, Stitcher is feedback-driven where the optimization process focuses more on it shortcomings.

Backbone Method AP AP AP AP AP AP
ResNext 101 Baseline 41.6 63.8 45.3 24.8 45.1 53.3
Stitcher 43.1 (+1.5) 65.6 47.4 28.0 46.7 54.2
ResNet 101 + DCN Baseline 42.3 64.3 46.3 24.8 46.1 55.7
Stitcher 43.3 (+1.0) 65.6 47.2 27.1 47.0 56.0
ResNext 101 + DCN Baseline 44.1 66.5 48.4 26.8 47.5 57.8
Stitcher 45.4 (+1.3) 68.0 49.7 29.4 48.8 58.5
Table 6: Evaluation on large backbones on Faster R-CNN

Evaluation on Large Backbones. Table 6 shows the improvement from Stitcher on large backbones, , ResNext 101 [19], ResNet-101 with DCN [5] and ResNext-328d-101 with DCN [5]. Experiments are conducted on Faster R-CNN in 1 training period. Upon the higher baselines, Stitcher can still increase the performance by 1.0% to 1.5% AP, which demonstrates the robustness of Stitcher to complicated cases.

period AP AP AP AP AP AP
Baseline 1 36.7 58.4 39.6 21.1 39.8 48.1
2 37.7 59.2 41.0 21.6 40.6 49.6
4 37.3 58.1 40.1 20.3 39.6 50.1
6 35.6 55.9 38.4 19.8 37.7 47.6
Stitcher 1 38.6 60.5 41.8 24.4 41.9 49.3
6 40.4 62.5 44.2 26.1 43.1 51.5
Table 7: Evaluation on longer training periods on Faster R-CNN
mAP plane bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Baseline 80.5 80.6 85.8 79.0 74.0 71.7 86.6 88.7 88.6 62.6 87.7 71.9 88.1 88.7 86.8 86.1 56.8 85.0 78.6 84.8 78.3
Stitcher 81.6 87.8 87.10 78.3 70.7 71.5 87.2 88.7 88.9 64.5 87.9 78.2 87.8 87.8 87.3 86.0 58.7 85.1 78.4 87.6 81.7
Table 8: Evaluation on PASCAL VOC dataset on Faster R-CNN

Evaluation on Longer Training Periods. Table 7 shows the evaluation on longer training periods on Faster R-CNN with ResNet-50 and FPN backbones. For 6 training, the performance of baseline is degraded by over-fitting, while Stitcher still maintains a promising accuracy. The composition of stitched images is not fixed, which enriches data patterns and prevents over-fitting.

In addition, a reasonable question is whether the improvement from Stitcher is caused by the increase of instances trained. The maximum instances involved in 1 Stitcher is no more than those in the 4 baseline, while 1 Stitcher performs better than baseline in any training periods. It verifies that this factor is not the main reason for the improvement.

Evaluation on PASCAL VOC. Although Stitcher is inspired by the findings on COCO, it is still effective on other datasets. We evaluate Stitcher on Pascal VOC [8]. Following the protocol in [9], models are trained on the union of VOC 2007 trainval and VOC 2012 trainval. Evaluation is performed on VOC 2007 test. A total of 24k iterations are performed on 8 GPUs. The learning rates are set as 0.01 and 0.001 in the first two-thirds and the remaining one-third iterations respectively. Experiments are conducted on Faster R-CNN with ResNet-50 and FPN. As shown in Table. 8, Stitcher brings 1.1% mAP improvement.

Evaluation on Instance Segmentation. Beyond object detection, we also apply Stitcher to instance segmentation. Experiments are conducted on the COCO instance segmentation track [14]. We report COCO mask AP on the minival split. Models are trained on 1 training period, for 90k iterations. It is divided by 10 at 60k-th and 80k-th iterations. Training settings, including learning rate, weight decay, momentum, and pre-trained weights, directly follow the default configuration. As shown in Table 9, performance increases AP by 0.9% on ResNet-50 and by 1.3% on ResNet-101, with the assistance of Stitcher.

backbone AP AP AP AP
Baseline Res-50-FPN 34.3 15.8 36.7 50.5
Stitcher 35.1 (+0.8) 17.0 (+1.2) 37.8 (+1.1) 51.4 (+0.9)
Baseline Res-101-FPN 35.9 15.9 38.9 53.2
Stitcher 37.2 (+1.3) 19.0 (+3.1) 40.3 (+1.4) 53.7 (+0.5)
Table 9: Evaluation on Mask R-CNN

5.3 Ablation Studies

In this section, we provide empirical analysis to each component of Stitcher, selection paradigm and the threshold . Ablations studies are conducted on Faster R-CNN with ResNet-50 and FPN as backbones in 1 training period.

Selection Paradigm AP AP AP AP AP AP
No feedback All stitched 32.1 53.0 33.8 21.9 36.4 36.8
All common 36.7 58.4 39.6 21.1 39.8 48.1
Random sample 37.8 59.5 41.2 23.6 40.7 46.7
With feedback Input feedback 38.1 60.0 41.2 23.1 41.3 49.1
Classification feedback 38.5 60.6 41.9 23.9 41.6 48.8
Regression feedback 38.6 60.5 41.8 24.4 41.9 49.3
Total loss feedback 38.5 60.6 42.0 23.7 41.6 49.3
Table 10: Ablation study on selection paradigm

Selection Paradigm. To evaluate the selection paradigm in Stitcher, several experiments are set up for comparison as in Table 10.

  • All stitched: stitched images are utilized in all iterations;

  • All regular: regular images are always used (the common baseline);

  • Random sample: stitched or regular images are randomly sampled;

  • Input feedback: this is a simplified version of Stitcher, where the feedback ratio is calculated on the number of small objects among the batch;

  • Classification/Regression/Total loss feedback: It is guided via loss feedback.

As shown in Table 10, if stitched images are consistently exploited, it achieves an unacceptable performance. Mere image stitching does not work and brings no benefit. This reflects that the selection paradigm is indispensable in Stitcher. Random sampling can be viewed as a special version of multi-scale training. It performs better than the common baseline, but not promising. If the feedback ratio is based on the input, instead of loss, the accuracy is still higher than the baseline but is slightly inferior to Stitcher. Input data can not be similarly powerful as loss to reflect the optimization process, because small objects still have probability to be ignored. These comparisons reflect the necessity of selection paradigm in Stitcher. In addition, Stitcher achieves stable performance no matter which loss serves as feedback. Their performances are stable around 38.5 % to 38.6 % AP. For convenience, we pick regression loss as the common setting.

Figure 8: Ablation study on threshold
Dimension AP AP AP AP AP AP Spatial 4 38.6 60.5 41.8 24.4 41.9 49.3 Batch 2 38.3 60.2 41.7 22.9 41.3 49.5 3 38.5 60.5 42.0 22.9 41.8 49.1 4 38.6 60.6 42.1 23.4 41.5 50.3 5 38.7 60.8 41.9 23.7 41.6 50.1 6 38.6 60.7 42.1 23.5 41.5 49.5 7 38.4 60.5 41.8 23.6 41.5 48.6 8 38.3 60.6 41.6 24.3 41.3 49.0 Table 11: Concatenation dimensions

Threshold Value. There is only one hyper-parameter introduced in Stitcher, the threshold value for selection. We study the impact of the threshold value in Stitcher, as shown in Fig. 8. When the threshold is set below 0.4, Stitcher performs better than the common baseline. Otherwise, the performance rapidly decays to the ‘All stitched’ baseline. This observation verifies that setting the threshold value as 0.1 brings a good balance.

5.4 Concatenation along the Batch Dimension

To make Stitcher flexible, we propose to aggregate resized images along the batch dimension, instead of the original spatial dimension. Table 11 compares the results of these two implementation methods on various image numbers in one stitched image. Observations can be summarized.
(1) When is 4, Stitcher achieves the same 38.6% AP accuracy in both stitching methods. It proves that these two implementations are equivalent to each other.
(2) When other values are adopted, Stitcher still achieves similar performance. The top performance is achieved when is 5. Stitching along batch dimension is robust to the variation of , which equips Stitcher to fit different devices.

6 Conclusion

In this paper, we have proposed a simple yet effective data provider for object detection, termed Stitcher, which steadily enhances performance by a significant margin. It can be easily applied to various detectors, backbones, training periods, datasets and even on other vision task like instance segmentation. Moreover, it requires negligible additional computation during training and does not affect inference time. Abundant experiments have been conducted to verify its effectiveness. We hope Stitcher can serve as a common configuration in the future.


  • [1] maskrcnn-benchmark.
  • [2] Adelson, E.H., Anderson, C.H., Bergen, J.R., Burt, P.J., Ogden, J.M.: Pyramid methods in image processing. RCA engineer 29(6), 33–41 (1984)
  • [3] Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms - improving object detection with one line of code. In: ICCV. pp. 5562–5570 (2017)
  • [4] Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation policies from data. In: CVPR. pp. 113–123 (2019)
  • [5] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV. pp. 764–773 (2017)
  • [6] Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR. pp. 886–893 (2005)
  • [7] Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009)
  • [8]

    Everingham, M., Eslami, S.M.A., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision

    111(1), 98–136 (2015)
  • [9] Girshick, R.B.: Fast R-CNN. In: ICCV. pp. 1440–1448 (2015)
  • [10] Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR. pp. 580–587 (2014)
  • [11] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
  • [12] Li, Y., Chen, Y., Wang, N., Zhang, Z.: Scale-aware trident networks for object detection (2019)
  • [13] Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: CVPR. pp. 936–944 (2017)
  • [14] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755 (2014)
  • [15] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., Berg, A.C.: SSD: single shot multibox detector. In: ECCV. pp. 21–37 (2016)
  • [16] Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
  • [17] Singh, B., Davis, L.S.: An analysis of scale invariance in object detection SNIP. In: CVPR. pp. 3578–3587 (2018)
  • [18] Singh, B., Najibi, M., Davis, L.S.: SNIPER: efficient multi-scale training. In: NeurIPS. pp. 9333–9343 (2018)
  • [19] Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR. pp. 5987–5995 (2017)
  • [20] Zhang, H., Cissé, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: ICLR (2018)
  • [21] Zhang, Z., He, T., Zhang, H., Zhang, Z., Xie, J., Li, M.: Bag of freebies for training object detection neural networks. CoRR abs/1902.04103 (2019)
  • [22] Zoph, B., Cubuk, E.D., Ghiasi, G., Lin, T., Shlens, J., Le, Q.V.: Learning data augmentation strategies for object detection. CoRR abs/1906.11172 (2019)