Residual Features and Unified Prediction Network for Single Stage Detection

by   Kyoungmin Lee, et al.
Seoul National University

Recently, a lot of single stage detectors using multi-scale features have been actively proposed. They are much faster than two stage detectors that use region proposal networks (RPN) without much degradation in the detection performances. However, the feature maps in the lower layers close to the input which are responsible for detecting small objects in a single stage detector have a problem of insufficient representation power because they are too shallow. There is also a structural contradiction that the feature maps have to deliver low-level information to next layers as well as contain high-level abstraction for prediction. In this paper, we propose a method to enrich the representation power of feature maps using Resblock and deconvolution layers. In addition, a unified prediction module is applied to generalize output results and boost earlier layers' representation power for prediction. The proposed method enables more precise prediction, which achieved higher score than SSD on PASCAL VOC and MS COCO. In addition, it maintains the advantage of fast computation of a single stage detector, which requires much less computation than other detectors with similar performance. Code is available at


page 1

page 7


MDSSD: Multi-scale Deconvolutional Single Shot Detector for small objects

In order to improve the detection accuracy for objects at different scal...

SRF-GAN: Super-Resolved Feature GAN for Multi-Scale Representation

Recent convolutional object detectors exploit multi-scale feature repres...

SSD: Single Shot MultiBox Detector

We present a method for detecting objects in images using a single deep ...

NeurIPS 2019 Disentanglement Challenge: Improved Disentanglement through Learned Aggregation of Convolutional Feature Maps

This report to our stage 2 submission to the NeurIPS 2019 disentanglemen...

EXTD: Extremely Tiny Face Detector via Iterative Filter Reuse

In this paper, we propose a new multi-scale face detector having an extr...

You Only Look One-level Feature

This paper revisits feature pyramids networks (FPN) for one-stage detect...

Improving Object Detection with Inverted Attention

Improving object detectors against occlusion, blur and noise is a critic...

Code Repositories


RUN : Residual Features and Unified Prediction Network for Single-Stage Detection

view repo

1 Introduction

The development of deep neural networks (DNN) in recent years has achieved remarkable results not only in object detection but also in many other areas. In the early researches of object detection using DNN, much attention has been paid to representation learning that can replace hand-crafted features without much consideration on the speed of detectors. Recently, real-time detectors with low computational complexities have been actively researched.

Researches on two-stage detectors, mostly based on Faster R-CNN [21], applied the region proposal network

(RPN) and RoI pooling to the feature maps extracted by a state-of-the-art classifier, such as ResNet-101

[10]. On the other hand, the single-stage methods such as YOLO [18] and SSD [17] removed RoI pooling layer and predict bounding boxes and corresponding class confidences directly while enabling faster detection and end-to-end learning.

Figure 1: Box-in-Box problem. Top: SSD300. Bottom: RUN300 (proposed). SSD detects objects with overlapping boxes which are redundant.

Especially SSD makes use of multi-scale feature maps generated from a backbone network such as VGG-16 [25] to detect objects in various sizes. Since each of the prediction modules composed of 3 3 convolution filters detects bounding box on each layer separately, they cannot reflect appropriate contextual information from different scales. It causes the problem named as “Box-in-Box” [11] as shown in Figure 1. In the figure, we can see that SSD often detects a single object with two overlapping boxes. The smaller box has partial image such as the upper body of a person or the head of an animal.

To solve the problem, [6, 15] used ResNet and feature pyramid network (FPN) [14] structure to inject larger contextual information through deep convolutional back-bone by the use of deconvolution. However, these structures have the disadvantage of increasing the computational complexity, thus reduces detection speed, which is a key advantage of a single-stage detector.

In this paper, we propose very simple ideas to solve the essential problems of multi-scale single stage detectors. First, we introduce a 3-way residual block, which is a structure where the Resblock [10] and the deconvolution layer are added on the multi-scale feature maps. It makes detected boxes be determined with larger context and be more reliable. Second, we integrate the multiple prediction modules, which had been applied separately to each layer, into one to boost information level of feature maps from earlier layers.

The proposed structure, called “RUN; Residual features and Unified prediction Network”, is a single-stage detector that combines 3-way Resblock with unified prediction module on VGG-16 network. RUN is not only very compact and fast compared to other ResNet-based two-stage detectors and single-stage detectors using FPN, but it also achieves superior or competitive performance compared to other competitors.

Figure 2: Networks of SSD and RUN. Top: SSD. Bottom: RUN. Compared with SSD, RUN has residual blocks and unified prediction module. The arrow from the bottom to the top indicates the deconvolution branch.

2 Related Works

Overfeat [23], SPPNet [9], R-CNN [8], Fast R-CNN [7], Faster R-CNN [21] and R-FCN [13] which are classified as

region-based convolutional neural networks

(R-CNN) showed a tremendous improvement in performance compared to the previous object detection techniques. These region-based approaches have achieved huge advances over the last few years and are still the state-of-the-art approaches among many object detection techniques. Specifically these approaches usually use a two-stage method of generating a number of bounding boxes and then assigning a classification score to the bounding boxes. Thus, although classification may be relatively accurate, these are too slow to be used for real-time applications.

Redmon et al. [18]

proposed a method named as YOLO to predict bounding boxes and associated class probabilities in a single step by framing object detection as a regression problem. It divides input images to grid maps and regresses bounding boxes for multiple objects on each grid. This was the beginning of single stage detection and subsequently inspired structures such as SSD

[17]. However, since YOLO uses only the highest-level111In this paper, the term level is used interchangeably with layer. Highest level indicates the the farthest layer from the input layer. feature maps to detect objects, there is a lack of lower-level information, which results in somewhat inaccurate detection, especially for small objects.

In order to solve this problem, SSD [17] utilized not only the highest-level features but also lower-level features which have enough resolution to detect small objects. As mentioned in Inside-Outside Net (ION) [1] and HyperNet [12], each feature maps at different layers have different abstraction levels for an input image. Therefore, it is clear that using multi-scale feature maps can improve detection performance for objects of various scales. In SSD, many default boxes are created in the feature maps and bounding box regression and classification are performed for each box area using 3 3 convolution. This method enables multi-scale object detection without using RoI pooling. In addition, it can effectively improve the detection accuracy of small objects which is a disadvantage of YOLO [18]. However, as mentioned in MS-CNN [2], SSD has the problem that back-propagation allows the gradient to cause unnecessary deformations in the feature maps since the feature maps of the backbone network are used directly in bounding box regression and classification. Then, it can lead to some instability during learning. In addition, since each classifier only uses single scale feature maps, it cannot reflect larger or smaller contextual information other than the one for the corresponding scale.

Recently, various methods have attempted to enhance the contextual information of each layer while taking advantage of SSD [17]. DSSD [6] could obtain higher accuracy by changing the base network to ResNet-101 [10] and combining the FPN [14] using deconvolution layers in combination with the existing multiple layers to reflect the large scale context. However, with the use of deep structure of ResNet-101 and deconvolution layers, the processing speed degrades much (under 16.4 images per second), which prohibits the method to be used for real-time detection problems.

Ren et al. [20] introduced a recurrent rolling convolution (RRC) architecture to improve detection performance by mutually complementing layers having different sizes of contextual information. RRC made multi-scale feature maps include large and small context by concatenating adjacent feature maps by pooling and deconvolution. This process was implemented by RNN structure and it allowed to reflect not only the information of the adjacent feature maps but also the information of the remote feature maps.

Unlike RRC [20], Rainbow SSD (R-SSD) [11] proposed a method to concatenate feature maps not only in the adjacent layers but also in all the layers for bounding box regression and classification using pooling and deconvolution. It achieves higher performance than SSD by enhancing representation power of feature maps. Also, by making the dimension of each layer the same, it made it possible to use a unified prediction module instead of different prediction modules for different layers. Woo et al. [27] proposed StairNet which utilizes both FPN [14] structure of base VGG-16 network and unified prediction of R-SSD.

Additionally, Lin et al. [15] redefined the loss term for object detection which is named as Focal Loss. Unlike other batch reconstruction methods like OHEM [24], it effectively resolves the foreground-background imbalance problem by changing the loss term. Their RetinaNet which uses Focal Loss in combination with ResNet and FPN [14] structure achieved the state-of-the-art performance.

3 Residual Feature and Unified Prediction Network (RUN)

In this section, we propose residual feature maps and unified prediction module. It shows how the addition of a structurally simple idea can complement the drawbacks of SSD-based single-stage object detection methods.

Figure 3: Residual blocks. Left: 2-way Resblock. Right: 3-way Resblock with deconvolution branch (branch3).

3.1 Residual Feature Maps

Recent CNN models designed for object detection makes use of a backbone network which is originally devised to solve image classification problems. Although the detection network can be trained end-to-end, the backbone network is normally initialized with the weights for the image classification problems. The relation between the features and predictions in the networks used for image classification can be expressed mathematically as follows:


where is an input image, is the -level feature map, is a prediction function, and

is a combination of non-linear transformations such as convolution, pooling, ReLU, etc. Here, the top feature map,

, learns information on high-level abstraction. On the other hand, () has more local and low-level information as becomes smaller.

SSD [17] applies several feature maps with different scales directly as an input to separate prediction modules to calculate object positions and classification scores, which can be denoted by the following equation:


where to are feature indices for source feature maps for multi-scale prediction, is a function that outputs multiple objects with different positions and scores. Combining (1) and (3), it can be expressed as


where Here, the earlier feature map needs to learn high-level abstraction to improve the performance of . At the same time, it also needs to learn local features for efficient information transfer to the next feature maps. This not only makes learning difficult, but also causes the overall performance to decrease.

To resolve this problem, SSD [17] added L2 normalization layer between the conv4_3 layer and the prediction module, which results in a reduced magnitude of the gradients from the prediction module. Cai et al. [2] tried to solve this problem by adding a convolution layer only to the conv4_3 layer. Since the above problem is not solely on the conv4_3 layer, the aforementioned approaches do not essentially solve the problem. To meet this contradictory requirement of maintaining low-level information while having the flexibility to learn high-level abstraction, it is desired to separate and decouple the backbone network and the prediction module in the training phase.

In order to solve the same problem, we propose a new architecture that decouples backbone network from the prediction module as shown in Figure 2. Instead of directly connecting the feature maps in the backbone network to the prediction module, we inserted a multi-way Resblock for each level of feature maps, which acts like a bumper. The detailed architecture of the proposed multi-way Resblocks are shown in Figure 3. Convolution layers and nonlinear activation units are used for all branches of the proposed Resblock. This prevents the gradients of the prediction module from flowing directly into the feature maps of the backbone network. Also, it clearly distinguishes the features to be used for prediction from the features to be delivered to the next layer. In other words, the proposed Resblock takes the role of learning high-level abstraction for object detection, while the backbone network containing low-level features is designed to be intact from the high-level detection information. This design helps to improve the feature structure of the SSD [17] by forcing it not to learn high-level abstraction and to keep low-level image features.

Also, the depths of the earlier layers (eg. conv4_3) used for small-sized object detection in SSD [17] are very shallow. Therefore, in SSD, small objects can not be detected well because the representation power is insufficient to be used in the prediction as it is. To supplement this problem, we used a 3 3 convolution layer in branch2 of the Resblock as shown in Figure 3 to reflect the peripheral contextual information.

Branch3 in the right side of Figure 3 contains a deconvolution layer whose input is the feature maps of the consecutive layer. This is similar to a structure proposed in [6] and [20], and it is a proper method to propagate large contextual information to a small scale feature map so that even when detecting a small object, information about its surroundings is also utilized. This can reduce the cases of detecting a part of an actual object. Thus, it can be a remedy for the box-in-box problem described earlier. The effect of this is intuitively shown in the right side of Figure 1. Finally, the proposed architecture in Figure 2 can be expressed as follows:


where and Here, , and indicate branch1, branch2 and branch3, respectively.

3.2 Unified Prediction Module

Figure 4: Comparison of various object detection schemes: a) R-CNN and its variants need object-wise cropping and the prediction is done by a common unified classifier. b) SSD does not need any cropping but requires a separate classifier for each scale of feature maps. c) R-SSD concatenates feature maps in different layers so that objects in each scale can be predicted with one unified classifier with the same amount of information. d) In the proposed method, Resblock takes the role of feature map concatenation and one unified classifier is used for prediction.

Detecting objects of various sizes has been recognized as an important problem in object detection. Traditionally, [26, 3, 4] used a single classifier to predict multi-scale feature maps extracted from the image pyramid. There is another approach of using multiple classifiers on a single input image. The latter has the advantage of reducing the amount of computation for calculating feature maps. However, it requires an individual classifier for each object scale.

Since the neural network has been prevalent, the two-stage detectors applied RoI Pooling to the CNN output to extract feature maps of the same size from objects of different sizes. These feature maps were used as the input of a single classifier. Meanwhile, other methods using multi-scale features, such as SSD [17], adopted multiple classifiers since feature maps in each scale differed not only in length but also in the underlying contextual information. In order to effectively learn the prediction layers of various scales, it is necessary to input objects of various scales. SSD could dramatically increase the detection performance through augmentation which transforms the size of input images.

R-SSD [11] proposed the Rainbow concatenation which combines feature maps in different scales using pooling and deconvolution. This allows to set the depth of the input features for each prediction module to be the same. Thus, R-SSD could use a single classifier that shares the weight of multi-scale prediction modules. Similarly, the proposed 3-way Resblocks enforce all the feature maps to have the same depth of 256 as shown in Fig. 3. Thus, structurally, it is possible to unify convolution layers of different prediction modules like R-SSD. The idea of the unified prediction module is similar to [11], but our method is different from R-SSD in information contained in the input feature maps.

This approach makes differently-scaled feature maps have similar level of information. SSD [17] used multiple features of various scales. This results in an improved performance of detecting small objects compared with YOLO [18, 19] which used only the last layer of the back-bone network. However, since its earliest feature map is obtained from much shallower layers than the later feature maps, it still has a limitation of insufficient information for prediction. Because unified prediction applies equally to feature maps of all scales, it forces the output of the 3-way Resblock between the feature map of the backbone and the prediction module to be learned at a similar information level. It means that unified prediction in combination with the residual feature block makes the feature maps in the earliest Resblocks rich in context. A brief summary of different object detection schemes is shown in Fig. 4.

4 Experiment

We experimented the proposed method on PASCAL VOC 2007 [5], PASCAL VOC 2012 and MS COCO datasets [16]. Our implementation is based on the publicly available SSD [17]222 All of experiments results of SSD are the latest scores with data augmentation mentioned in [6]. For all the experiments, the reduced VGG-16 model [25] pre-trained on the ILSVRC CLS-LOC dataset [22] is used as the backbone network. For fair comparison, most of the settings are set to be the same as those of SSD except the number of proposals. It is different from SSD, because we used 6 default boxes in all the prediction layers for unified prediction while SSD used 4 for the conv4_3 and the top layer, and 6 for the rest.

Method mAP
SSD 300 77.5
SSD 300 + 2WAY 78.3
SSD 300 + 2WAY + Unified Pred 78.6
SSD 300 + 3WAY 78.8
SSD 300 + 3WAY + Unified Pred 79.2
Table 1: PASCAL 2007 test detection results.
Method data network mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
SSD300 [17] 07++12 VGG 75.8 88.1 82.9 74.4 61.9 47.6 82.7 78.8 91.5 58.1 80.0 64.1 89.4 85.7 85.5 82.6 50.2 79.8 73.6 86.6 72.1
SSD321 [6] 07++12 Residual-101 75.4 87.9 82.9 73.7 61.5 45.3 81.4 75.6 92.6 57.4 78.3 65.0 90.8 86.8 85.8 81.5 50.3 78.1 75.3 85.2 72.5
DSSD321 [6] 07++12 Residual-101 76.3 87.3 83.3 75.4 64.6 46.8 82.7 76.5 92.9 59.5 78.3 64.3 91.5 86.6 86.6 82.1 53.3 79.6 75.7 85.2 73.9

R-SSD300 [11]
07++12 VGG 76.4 88.0 83.8 74.8 60.8 48.9 83.9 78.5 91.0 59.5 81.4 66.1 89.0 86.3 86.0 83.0 51.3 80.9 73.7 86.9 73.8
StairNet [27] 07++12 VGG 76.4 87.7 83.1 74.6 64.2 51.3 83.6 78.0 92.0 58.9 81.8 66.2 89.6 86.0 84.9 82.6 50.9 80.5 71.8 86.2 73.5
RUN2WAY300 07++12 VGG 76.2 88.4 83.2 73.7 63.0 50.2 82.6 79.2 91.3 58.6 81.4 64.8 90.0 85.9 85.4 82.8 50.5 81.3 74.0 86.1 72.4

07++12 VGG 77.1 88.2 84.4 76.2 63.8 53.1 82.9 79.5 90.9 60.7 82.5 64.1 89.6 86.5 86.6 83.3 51.5 83.0 74.0 87.6 74.4
Table 2: SSD300-based models on PASCAL 2012 test. Trained with 07++12 (07 trainval + 07 test + 12 trainval).
Method data network mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
ION [1] 07+12+S VGG 76.4 87.5 84.7 76.8 63.8 58.3 82.6 79.0 90.9 57.8 82.0 64.7 88.9 86.5 84.7 82.3 51.4 78.2 69.2 85.2 73.5

Faster [10]
07++12 Residual-101 73.8 86.5 81.6 77.2 58.0 51.0 78.6 76.6 93.2 48.6 80.4 59.0 92.1 85.3 84.8 80.7 48.1 77.3 66.5 84.7 65.6

R-FCN [13]
07++12 Residual-101 77.6 86.9 83.4 81.5 63.8 62.4 81.6 81.1 93.1 58.0 83.8 60.8 92.7 86.0 84.6 84.4 59.0 80.8 68.6 86.1 72.9
SSD512 [17] 07++12 VGG 78.5 90.0 85.3 77.7 64.3 58.5 85.1 84.3 92.6 61.3 83.4 65.1 89.9 88.5 88.2 85.5 54.4 82.4 70.7 87.1 75.6
SSD 513 [6] 07++12 Residual-101 79.4 90.7 87.3 78.3 66.3 56.5 84.1 83.7 94.2 62.9 84.5 66.3 92.9 88.6 87.9 85.7 55.1 83.6 74.3 88.2 76.8

DSSD 513 [6]
07++12 Residual-101 80.0 92.1 86.6 80.3 68.7 58.2 84.3 85.0 94.6 63.3 85.9 65.6 93.0 88.5 87.8 86.4 57.4 85.2 73.4 87.8 76.8
RUN2WAY512 07++12 VGG 79.3 89.7 87.1 79.2 65.6 61.3 85.3 85.0 92.9 60.6 83.8 66.4 90.6 88.6 88.1 86.1 54.8 84.6 72.5 87.4 75.8

07++12 VGG 79.8 90.0 87.3 80.2 67.4 62.4 84.9 85.6 92.9 61.8 84.9 66.2 90.9 89.1 88.0 86.5 55.4 85.0 72.6 87.7 76.8

Table 3: SSD500-based models and other two-stage detectors on PASCAL 2012 test. Trained with 07++12 (07 trainval + 07 test + 12 trainval).
Method data network Avg. Precision, IoU: Avg. Precision, Area: Avg. Recall, #Dets: Avg. Recall, Area:
0.5:0.95 0.5 0.75 S M L 1 10 100 S M L
Faster [21] trainval VGG 21.9 42.7 - - - - - - - - - -
ION [1] train VGG 23.6 43.2 23.6 6.4 24.1 38.3 23.2 32.7 33.5 10.1 37.7 53.6
R-FCN [13] trainval Residual-101 29.9 51.9 - 10.8 32.8 45.0 - - - - - -
RetinaNet [15] trainval Residual-101 39.1 59.1 42.3 21.8 42.7 50.2 - - - - - -
SSD300 [17] trainval35k VGG 25.1 43.1 25.8 6.6 25.9 41.4 23.7 35.1 37.2 11.2 40.4 58.4
SSD321 [6] trainval35k Residual-101 28.0 45.4 29.3 6.2 28.3 49.3 25.9 37.8 39.9 11.5 43.3 64.9
DSSD321 [6] trainval35k Residual-101 28.0 46.1 29.2 7.4 28.1 47.6 25.5 37.1 39.4 12.7 42.0 62.6
RUN2WAY300 trainval35k VGG 27.4 46.1 28.4 8.9 27.9 43.8 25.0 37.3 39.5 14.6 42.6 59.8
RUN3WAY300 trainval35k VGG 28.0 47.5 28.9 9.9 28.6 43.9 25.3 38.0 40.5 16.2 43.8 60.2
SSD512 [17] trainval35k VGG 28.8 48.5 30.3 10.9 31.8 43.5 26.1 39.5 42.0 16.5 46.6 60.8
SSD513 [6] trainval35k Residual-101 31.2 50.4 33.3 10.2 34.5 49.8 28.3 42.1 44.4 17.6 49.2 65.8
DSSD513 [6] trainval35k Residual-101 33.2 53.3 35.2 13.0 35.4 51.1 28.9 43.5 46.2 21.8 49.1 66.4
RUN2WAY512 trainval35k VGG 31.7 52.1 33.6 13.2 33.9 46.5 27.7 42.2 44.7 22.0 47.9 62.7
RUN3WAY512 trainval35k VGG 32.4 53.5 34.2 14.7 34.0 46.7 28.0 43.0 45.8 24.4 48.1 63.4
Table 4: MS COCO test-dev detection results.

Ablation Study on PASCAL VOC2007

We trained our model on VOC2007 trainval and VOC2012 trainval. We set the batch size as 32. For the training of the 2-way model, we used learning rate of initially, then it decreased by a factor of 10 at 80k and 100k iterations, respectively. The training was terminated at 120k iterations. For the 3-way model, we froze all the weights of the pre-trained 2-way model except the prediction module, then fine-tuned the network using the learning rate of for 40k iterations, for the next 20k iterations, and for the final 10k iterations. The end-to-end training was also applied on the 3-way model, but the results were worse than the above training strategy.

Table 1 shows our result on PASCAL VOC 2007 test set. Here, Unified Pred is the proposed unified prediction module and the prediction modules for the ones without this indication were trained separately as in the original SSD. As mentioned above, each 3-way model was fine-tuned on the corresponding 2-way model. In this experiment, we observed that the proposed model with only 2-way Resblock without the deconvolution path achieved 1.1% higher mAP than that of SSD. The 3-way model which further utilizes deconvolution layers was up to 0.6% higher than the 2-way model. The unified prediction module made better advance in the 3-way model than the 2-way model, which scored 79.2% and 78.4% respectively.

Pascal Voc 2012

For VOC 2012 test, we trained models on 07++12 dataset consisting 07trainval, 07test and 12trainval. First, we performed an experiment applying the 2-way Resblock in combination with the unified prediction, then, another experiment was performed using the 3-way Resblock with unified prediction after freezing the weights of the contained 2-way Resblock.

Table 2 shows the VOC 2012 test results of RUN300 and other models based on SSD300 [17]. The proposed model, RUN300, has a big performance improvement compared to the base model SSD300. Especially, the 3-way model achieved 77.1% mAP, outperforming other SSD-based models. In addition, it showed improvement of 0.7% mAP compared to StairNet [27] which uses FPN [14] and unified prediction. From this result, we can conjecture that the proposed 3-way Resblock is more effective than FPN.

Table 3 shows results of RUN512 models and others. The 3-way model achieved 79.8% mAPs, which is 1.3% better than that of SSD512 [17]. It performs slightly worse than DSSD513 [6], which is probably because the ResNet-101 [10] backbone of DSSD513 produces better features for larger input images than VGG-16 [25] of SSD and ours.


For fair comparison with SSD [17], most of the hyper-parameters required for training were set to the same as SSD. For training 2-way models, we used a learning rate of for the first 240k iterations, for the next 120k iterations and for the last 40k. For training 3-way models, we used a learning rate of for the first 120k iterations, for the next 60k iterations and for the last 20k, which are exactly half of those for the 2-way models. Other parameters such as scales and aspect ratios of the prior box were identical to those of SSD.

network mAP FPS GPU
Faster R-CNN [21] VGG16 73.2 7 Titan X
Faster R-CNN [10] Residual-101 76.4 2.4 K40
R-FCN [13] Residual-101 80.5 9 Titan X

SSD300 [17]
VGG16 77.5 54.5* Titan X
SSD321 [6] Residual-101 77.1 16.4 Titan X
DSSD321 [6] Residual-101 78.6 11.8 Titan X
R-SSD300 [11] VGG16 78.5 37.1* Titan X
StairNet [27] VGG16 78.8 30 Titan X Pascal
[.4pt/1pt] RUN2WAY300 VGG16 78.6 41.8 Titan X
58.4 Titan X Pascal
[.4pt/1pt] RUN3WAY300 VGG16 79.2 40.0 Titan X
56.3 Titan X Pascal

SSD512 [17]
VGG16 79.5 24.5* Titan X
SSD513 [6] Residual-101 80.6 8.0 Titan X
DSSD513 [6] Residual-101 81.5 6.4 Titan X
R-SSD512 [11] VGG16 80.8 15.8* Titan X
[.4pt/1pt] RUN2WAY512 VGG16 80.6 20.1 Titan X
31.8 Titan X Pascal
[.4pt/1pt] RUN3WAY512 VGG16 80.9 19.5 Titan X
29.8 Titan X Pascal
Table 5: Speed & Accuracy on PASCAL VOC2007 test. * is measured by ourselves.

Table 4 shows the performance of various methods on MS COCO test-dev. Despite the proposed methods use a relatively shallow network, VGG-16 [25], they achieved enough performance to compare with other methods which use a very deep network. The fourth column indicates that RUN3WAY300 achieved 2.9% better mAP compared to SSD300 [17]. It was the same performance with SSD321 and DSSD321 [6], which adopted ResNet-101 [10]

as their back-bone network. Also, RUN3WAY512 achieved 3.6% better mAP than SSD512. In particular, RUN3WAY512 achieved the highest average precision and recall for small objects among compared methods except RetinaNet. It means that the proposed Resblock is a quite effective module to enhance low-level feature maps.

Speed vs Accuracy

Figure 5: Speed vs. Accuracy of recent methods using public numbers on COCO. Our results (sky blue circles) are measured on Titan X. (Best viewed in color.)
Figure 6: Detection examples of RUN300 3-way on PASCAL VOC 2012 test set compared with SSD300 model. For each pair, the up side is the result of SSD and down side is the result of RUN. We show detections with scores higher than 0.6. Each color corresponds to an object category.
Figure 7: Detection examples of RUN300 3-way on MS COCO test-dev set compared with SSD300 model. For each pair, the up side is the result of SSD and down side is the result of RUN. We show detections with scores higher than 0.5. Each color corresponds to an object category.

The single stage detectors, which are represented by YOLO [18] and SSD [17], proposed end-to-end neural networks that removed the RoI Pooling of two-stage detectors. They have achieved a lot of speed improvements, but they could not avoid the loss of accuracy. Conversely, recent single stage detectors have been studied to improve performance, while suffering the loss of speed. Unlike other approaches, the proposed RUN is designed to maximize performance at high speeds on the VGG-16 [25] backbone, which has significantly fewer layers and parameters than ResNet [10]. The experimented results demonstrate the performance improvement of RUN.

Table5 shows that our method outperforms other competitors with less loss of speed. Our experiments were tested using Titan X GPU, cuDNN v5.1 and Intel I7-6700@3.4GHz. For exact comparison, we measured FPS of some methods on the same environment and marked * in the table.

In Figure 5, we show the trade-off relation between the detection accuracy and inference time by plotting the results of RUN and other methods on COCO test-dev. The RUN-3way-300 model (25.0ms, 28.0% mAP) is 36% slower but 2.9% better in mAP than the SSD300 [17] model (18.3ms, 25.1% mAP). It is about 60% faster than ResNet-101 based SSD321 [6] model (61ms, 28.0% mAP) that has a similar performance. Likewise, the RUN-3way-512 (51.4ms, 32.4% mAP) is 26% slower but 3.6% better in mAP than the SSD512 model (40.8ms, 28.8% mAP). It is about 44% faster than RetinaNet-50-500 [15] (73ms, 32.5% mAP) that has a similar performance.

In addition, we measured FPS of our methods on Titan X Pascal with the other environment kept the same. Table 5 shows that even the most complex version of our method, RUN3WAY512, can works in real time (29.8 FPS) on Tital X Pascal.

5 Conclusion

The proposed RUN architecture for object detection was originated from the awareness of the contradictory requirements for multi-scale features that they should contain low-level information on an image as well as high-level information on objectness. The proposed 3-way Resblock alleviated the gradient exploitation problem and enriched contextual information, an important element of prediction. We also showed that the generalization performance of multi-scale prediction can be improved by integrating the separate prediction modules into one unified prediction module. This approach, which can be seen to be somewhat simple, resulted in outstanding performance on the PASCAL VOC test. The results on COCO dataset also show how fast and efficient our algorithms are. We expect the proposed method be not restricted to SSD-based methods but also applicable to other structures utilizing multi-scale features.