RUN : Residual Features and Unified Prediction Network for Single-Stage Detection
Recently, a lot of single stage detectors using multi-scale features have been actively proposed. They are much faster than two stage detectors that use region proposal networks (RPN) without much degradation in the detection performances. However, the feature maps in the lower layers close to the input which are responsible for detecting small objects in a single stage detector have a problem of insufficient representation power because they are too shallow. There is also a structural contradiction that the feature maps have to deliver low-level information to next layers as well as contain high-level abstraction for prediction. In this paper, we propose a method to enrich the representation power of feature maps using Resblock and deconvolution layers. In addition, a unified prediction module is applied to generalize output results and boost earlier layers' representation power for prediction. The proposed method enables more precise prediction, which achieved higher score than SSD on PASCAL VOC and MS COCO. In addition, it maintains the advantage of fast computation of a single stage detector, which requires much less computation than other detectors with similar performance. Code is available at https://github.com/kmlee-snu/runREAD FULL TEXT VIEW PDF
In order to improve the detection accuracy for objects at different scal...
Recent convolutional object detectors exploit multi-scale feature
Modern human-object interaction (HOI) detection approaches can be divide...
This paper revisits feature pyramids networks (FPN) for one-stage detect...
Improving object detectors against occlusion, blur and noise is a critic...
Real-time fault detection for freight trains plays a vital role in
In this paper, we propose a new multi-scale face detector having an extr...
RUN : Residual Features and Unified Prediction Network for Single-Stage Detection
The development of deep neural networks (DNN) in recent years has achieved remarkable results not only in object detection but also in many other areas. In the early researches of object detection using DNN, much attention has been paid to representation learning that can replace hand-crafted features without much consideration on the speed of detectors. Recently, real-time detectors with low computational complexities have been actively researched.
Researches on two-stage detectors, mostly based on Faster R-CNN , applied the region proposal network
(RPN) and RoI pooling to the feature maps extracted by a state-of-the-art classifier, such as ResNet-101. On the other hand, the single-stage methods such as YOLO  and SSD  removed RoI pooling layer and predict bounding boxes and corresponding class confidences directly while enabling faster detection and end-to-end learning.
Especially SSD makes use of multi-scale feature maps generated from a backbone network such as VGG-16  to detect objects in various sizes. Since each of the prediction modules composed of 3 3 convolution filters detects bounding box on each layer separately, they cannot reflect appropriate contextual information from different scales. It causes the problem named as “Box-in-Box”  as shown in Figure 1. In the figure, we can see that SSD often detects a single object with two overlapping boxes. The smaller box has partial image such as the upper body of a person or the head of an animal.
To solve the problem, [6, 15] used ResNet and feature pyramid network (FPN)  structure to inject larger contextual information through deep convolutional back-bone by the use of deconvolution. However, these structures have the disadvantage of increasing the computational complexity, thus reduces detection speed, which is a key advantage of a single-stage detector.
In this paper, we propose very simple ideas to solve the essential problems of multi-scale single stage detectors. First, we introduce a 3-way residual block, which is a structure where the Resblock  and the deconvolution layer are added on the multi-scale feature maps. It makes detected boxes be determined with larger context and be more reliable. Second, we integrate the multiple prediction modules, which had been applied separately to each layer, into one to boost information level of feature maps from earlier layers.
The proposed structure, called “RUN; Residual features and Unified prediction Network”, is a single-stage detector that combines 3-way Resblock with unified prediction module on VGG-16 network. RUN is not only very compact and fast compared to other ResNet-based two-stage detectors and single-stage detectors using FPN, but it also achieves superior or competitive performance compared to other competitors.
Overfeat , SPPNet , R-CNN , Fast R-CNN , Faster R-CNN  and R-FCN  which are classified as region-based convolutional neural networks
region-based convolutional neural networks(R-CNN) showed a tremendous improvement in performance compared to the previous object detection techniques. These region-based approaches have achieved huge advances over the last few years and are still the state-of-the-art approaches among many object detection techniques. Specifically these approaches usually use a two-stage method of generating a number of bounding boxes and then assigning a classification score to the bounding boxes. Thus, although classification may be relatively accurate, these are too slow to be used for real-time applications.
Redmon et al. 
proposed a method named as YOLO to predict bounding boxes and associated class probabilities in a single step by framing object detection as a regression problem. It divides input images to grid maps and regresses bounding boxes for multiple objects on each grid. This was the beginning of single stage detection and subsequently inspired structures such as SSD. However, since YOLO uses only the highest-level111In this paper, the term level is used interchangeably with layer. Highest level indicates the the farthest layer from the input layer. feature maps to detect objects, there is a lack of lower-level information, which results in somewhat inaccurate detection, especially for small objects.
In order to solve this problem, SSD  utilized not only the highest-level features but also lower-level features which have enough resolution to detect small objects. As mentioned in Inside-Outside Net (ION)  and HyperNet , each feature maps at different layers have different abstraction levels for an input image. Therefore, it is clear that using multi-scale feature maps can improve detection performance for objects of various scales. In SSD, many default boxes are created in the feature maps and bounding box regression and classification are performed for each box area using 3 3 convolution. This method enables multi-scale object detection without using RoI pooling. In addition, it can effectively improve the detection accuracy of small objects which is a disadvantage of YOLO . However, as mentioned in MS-CNN , SSD has the problem that back-propagation allows the gradient to cause unnecessary deformations in the feature maps since the feature maps of the backbone network are used directly in bounding box regression and classification. Then, it can lead to some instability during learning. In addition, since each classifier only uses single scale feature maps, it cannot reflect larger or smaller contextual information other than the one for the corresponding scale.
Recently, various methods have attempted to enhance the contextual information of each layer while taking advantage of SSD . DSSD  could obtain higher accuracy by changing the base network to ResNet-101  and combining the FPN  using deconvolution layers in combination with the existing multiple layers to reflect the large scale context. However, with the use of deep structure of ResNet-101 and deconvolution layers, the processing speed degrades much (under 16.4 images per second), which prohibits the method to be used for real-time detection problems.
Ren et al.  introduced a recurrent rolling convolution (RRC) architecture to improve detection performance by mutually complementing layers having different sizes of contextual information. RRC made multi-scale feature maps include large and small context by concatenating adjacent feature maps by pooling and deconvolution. This process was implemented by RNN structure and it allowed to reflect not only the information of the adjacent feature maps but also the information of the remote feature maps.
Unlike RRC , Rainbow SSD (R-SSD)  proposed a method to concatenate feature maps not only in the adjacent layers but also in all the layers for bounding box regression and classification using pooling and deconvolution. It achieves higher performance than SSD by enhancing representation power of feature maps. Also, by making the dimension of each layer the same, it made it possible to use a unified prediction module instead of different prediction modules for different layers. Woo et al.  proposed StairNet which utilizes both FPN  structure of base VGG-16 network and unified prediction of R-SSD.
Additionally, Lin et al.  redefined the loss term for object detection which is named as Focal Loss. Unlike other batch reconstruction methods like OHEM , it effectively resolves the foreground-background imbalance problem by changing the loss term. Their RetinaNet which uses Focal Loss in combination with ResNet and FPN  structure achieved the state-of-the-art performance.
In this section, we propose residual feature maps and unified prediction module. It shows how the addition of a structurally simple idea can complement the drawbacks of SSD-based single-stage object detection methods.
Recent CNN models designed for object detection makes use of a backbone network which is originally devised to solve image classification problems. Although the detection network can be trained end-to-end, the backbone network is normally initialized with the weights for the image classification problems. The relation between the features and predictions in the networks used for image classification can be expressed mathematically as follows:
where is an input image, is the -level feature map, is a prediction function, and, learns information on high-level abstraction. On the other hand, () has more local and low-level information as becomes smaller.
SSD  applies several feature maps with different scales directly as an input to separate prediction modules to calculate object positions and classification scores, which can be denoted by the following equation:
where to are feature indices for source feature maps for multi-scale prediction, is a function that outputs multiple objects with different positions and scores. Combining (1) and (3), it can be expressed as
where Here, the earlier feature map needs to learn high-level abstraction to improve the performance of . At the same time, it also needs to learn local features for efficient information transfer to the next feature maps. This not only makes learning difficult, but also causes the overall performance to decrease.
To resolve this problem, SSD  added L2 normalization layer between the conv4_3 layer and the prediction module, which results in a reduced magnitude of the gradients from the prediction module. Cai et al.  tried to solve this problem by adding a convolution layer only to the conv4_3 layer. Since the above problem is not solely on the conv4_3 layer, the aforementioned approaches do not essentially solve the problem. To meet this contradictory requirement of maintaining low-level information while having the flexibility to learn high-level abstraction, it is desired to separate and decouple the backbone network and the prediction module in the training phase.
In order to solve the same problem, we propose a new architecture that decouples backbone network from the prediction module as shown in Figure 2. Instead of directly connecting the feature maps in the backbone network to the prediction module, we inserted a multi-way Resblock for each level of feature maps, which acts like a bumper. The detailed architecture of the proposed multi-way Resblocks are shown in Figure 3. Convolution layers and nonlinear activation units are used for all branches of the proposed Resblock. This prevents the gradients of the prediction module from flowing directly into the feature maps of the backbone network. Also, it clearly distinguishes the features to be used for prediction from the features to be delivered to the next layer. In other words, the proposed Resblock takes the role of learning high-level abstraction for object detection, while the backbone network containing low-level features is designed to be intact from the high-level detection information. This design helps to improve the feature structure of the SSD  by forcing it not to learn high-level abstraction and to keep low-level image features.
Also, the depths of the earlier layers (eg. conv4_3) used for small-sized object detection in SSD  are very shallow. Therefore, in SSD, small objects can not be detected well because the representation power is insufficient to be used in the prediction as it is. To supplement this problem, we used a 3 3 convolution layer in branch2 of the Resblock as shown in Figure 3 to reflect the peripheral contextual information.
Branch3 in the right side of Figure 3 contains a deconvolution layer whose input is the feature maps of the consecutive layer. This is similar to a structure proposed in  and , and it is a proper method to propagate large contextual information to a small scale feature map so that even when detecting a small object, information about its surroundings is also utilized. This can reduce the cases of detecting a part of an actual object. Thus, it can be a remedy for the box-in-box problem described earlier. The effect of this is intuitively shown in the right side of Figure 1. Finally, the proposed architecture in Figure 2 can be expressed as follows:
where and Here, , and indicate branch1, branch2 and branch3, respectively.
Detecting objects of various sizes has been recognized as an important problem in object detection. Traditionally, [26, 3, 4] used a single classifier to predict multi-scale feature maps extracted from the image pyramid. There is another approach of using multiple classifiers on a single input image. The latter has the advantage of reducing the amount of computation for calculating feature maps. However, it requires an individual classifier for each object scale.
Since the neural network has been prevalent, the two-stage detectors applied RoI Pooling to the CNN output to extract feature maps of the same size from objects of different sizes. These feature maps were used as the input of a single classifier. Meanwhile, other methods using multi-scale features, such as SSD , adopted multiple classifiers since feature maps in each scale differed not only in length but also in the underlying contextual information. In order to effectively learn the prediction layers of various scales, it is necessary to input objects of various scales. SSD could dramatically increase the detection performance through augmentation which transforms the size of input images.
R-SSD  proposed the Rainbow concatenation which combines feature maps in different scales using pooling and deconvolution. This allows to set the depth of the input features for each prediction module to be the same. Thus, R-SSD could use a single classifier that shares the weight of multi-scale prediction modules. Similarly, the proposed 3-way Resblocks enforce all the feature maps to have the same depth of 256 as shown in Fig. 3. Thus, structurally, it is possible to unify convolution layers of different prediction modules like R-SSD. The idea of the unified prediction module is similar to , but our method is different from R-SSD in information contained in the input feature maps.
This approach makes differently-scaled feature maps have similar level of information. SSD  used multiple features of various scales. This results in an improved performance of detecting small objects compared with YOLO [18, 19] which used only the last layer of the back-bone network. However, since its earliest feature map is obtained from much shallower layers than the later feature maps, it still has a limitation of insufficient information for prediction. Because unified prediction applies equally to feature maps of all scales, it forces the output of the 3-way Resblock between the feature map of the backbone and the prediction module to be learned at a similar information level. It means that unified prediction in combination with the residual feature block makes the feature maps in the earliest Resblocks rich in context. A brief summary of different object detection schemes is shown in Fig. 4.
We experimented the proposed method on PASCAL VOC 2007 , PASCAL VOC 2012 and MS COCO datasets . Our implementation is based on the publicly available SSD 222https://github.com/weiliu89/caffe/tree/ssd. All of experiments results of SSD are the latest scores with data augmentation mentioned in . For all the experiments, the reduced VGG-16 model  pre-trained on the ILSVRC CLS-LOC dataset  is used as the backbone network. For fair comparison, most of the settings are set to be the same as those of SSD except the number of proposals. It is different from SSD, because we used 6 default boxes in all the prediction layers for unified prediction while SSD used 4 for the conv4_3 and the top layer, and 6 for the rest.
|SSD 300 + 2WAY||78.3|
|SSD 300 + 2WAY + Unified Pred||78.6|
|SSD 300 + 3WAY||78.8|
|SSD 300 + 3WAY + Unified Pred||79.2|
|SSD 513 ||07++12||Residual-101||79.4||90.7||87.3||78.3||66.3||56.5||84.1||83.7||94.2||62.9||84.5||66.3||92.9||88.6||87.9||85.7||55.1||83.6||74.3||88.2||76.8|
DSSD 513 
|Method||data||network||Avg. Precision, IoU:||Avg. Precision, Area:||Avg. Recall, #Dets:||Avg. Recall, Area:|
We trained our model on VOC2007 trainval and VOC2012 trainval. We set the batch size as 32. For the training of the 2-way model, we used learning rate of initially, then it decreased by a factor of 10 at 80k and 100k iterations, respectively. The training was terminated at 120k iterations. For the 3-way model, we froze all the weights of the pre-trained 2-way model except the prediction module, then fine-tuned the network using the learning rate of for 40k iterations, for the next 20k iterations, and for the final 10k iterations. The end-to-end training was also applied on the 3-way model, but the results were worse than the above training strategy.
Table 1 shows our result on PASCAL VOC 2007 test set. Here, Unified Pred is the proposed unified prediction module and the prediction modules for the ones without this indication were trained separately as in the original SSD. As mentioned above, each 3-way model was fine-tuned on the corresponding 2-way model. In this experiment, we observed that the proposed model with only 2-way Resblock without the deconvolution path achieved 1.1% higher mAP than that of SSD. The 3-way model which further utilizes deconvolution layers was up to 0.6% higher than the 2-way model. The unified prediction module made better advance in the 3-way model than the 2-way model, which scored 79.2% and 78.4% respectively.
For VOC 2012 test, we trained models on 07++12 dataset consisting 07trainval, 07test and 12trainval. First, we performed an experiment applying the 2-way Resblock in combination with the unified prediction, then, another experiment was performed using the 3-way Resblock with unified prediction after freezing the weights of the contained 2-way Resblock.
Table 2 shows the VOC 2012 test results of RUN300 and other models based on SSD300 . The proposed model, RUN300, has a big performance improvement compared to the base model SSD300. Especially, the 3-way model achieved 77.1% mAP, outperforming other SSD-based models. In addition, it showed improvement of 0.7% mAP compared to StairNet  which uses FPN  and unified prediction. From this result, we can conjecture that the proposed 3-way Resblock is more effective than FPN.
Table 3 shows results of RUN512 models and others. The 3-way model achieved 79.8% mAPs, which is 1.3% better than that of SSD512 . It performs slightly worse than DSSD513 , which is probably because the ResNet-101  backbone of DSSD513 produces better features for larger input images than VGG-16  of SSD and ours.
For fair comparison with SSD , most of the hyper-parameters required for training were set to the same as SSD. For training 2-way models, we used a learning rate of for the first 240k iterations, for the next 120k iterations and for the last 40k. For training 3-way models, we used a learning rate of for the first 120k iterations, for the next 60k iterations and for the last 20k, which are exactly half of those for the 2-way models. Other parameters such as scales and aspect ratios of the prior box were identical to those of SSD.
|Faster R-CNN ||VGG16||73.2||7||Titan X|
|Faster R-CNN ||Residual-101||76.4||2.4||K40|
|R-FCN ||Residual-101||80.5||9||Titan X|
|SSD321 ||Residual-101||77.1||16.4||Titan X|
|DSSD321 ||Residual-101||78.6||11.8||Titan X|
|R-SSD300 ||VGG16||78.5||37.1*||Titan X|
|StairNet ||VGG16||78.8||30||Titan X Pascal|
|[.4pt/1pt] RUN2WAY300||VGG16||78.6||41.8||Titan X|
|58.4||Titan X Pascal|
|[.4pt/1pt] RUN3WAY300||VGG16||79.2||40.0||Titan X|
|56.3||Titan X Pascal|
|SSD513 ||Residual-101||80.6||8.0||Titan X|
|DSSD513 ||Residual-101||81.5||6.4||Titan X|
|R-SSD512 ||VGG16||80.8||15.8*||Titan X|
|[.4pt/1pt] RUN2WAY512||VGG16||80.6||20.1||Titan X|
|31.8||Titan X Pascal|
|[.4pt/1pt] RUN3WAY512||VGG16||80.9||19.5||Titan X|
|29.8||Titan X Pascal|
Table 4 shows the performance of various methods on MS COCO test-dev. Despite the proposed methods use a relatively shallow network, VGG-16 , they achieved enough performance to compare with other methods which use a very deep network. The fourth column indicates that RUN3WAY300 achieved 2.9% better mAP compared to SSD300 . It was the same performance with SSD321 and DSSD321 , which adopted ResNet-101 
as their back-bone network. Also, RUN3WAY512 achieved 3.6% better mAP than SSD512. In particular, RUN3WAY512 achieved the highest average precision and recall for small objects among compared methods except RetinaNet. It means that the proposed Resblock is a quite effective module to enhance low-level feature maps.
The single stage detectors, which are represented by YOLO  and SSD , proposed end-to-end neural networks that removed the RoI Pooling of two-stage detectors. They have achieved a lot of speed improvements, but they could not avoid the loss of accuracy. Conversely, recent single stage detectors have been studied to improve performance, while suffering the loss of speed. Unlike other approaches, the proposed RUN is designed to maximize performance at high speeds on the VGG-16  backbone, which has significantly fewer layers and parameters than ResNet . The experimented results demonstrate the performance improvement of RUN.
Table5 shows that our method outperforms other competitors with less loss of speed. Our experiments were tested using Titan X GPU, cuDNN v5.1 and Intel I7email@example.comGHz. For exact comparison, we measured FPS of some methods on the same environment and marked * in the table.
In Figure 5, we show the trade-off relation between the detection accuracy and inference time by plotting the results of RUN and other methods on COCO test-dev. The RUN-3way-300 model (25.0ms, 28.0% mAP) is 36% slower but 2.9% better in mAP than the SSD300  model (18.3ms, 25.1% mAP). It is about 60% faster than ResNet-101 based SSD321  model (61ms, 28.0% mAP) that has a similar performance. Likewise, the RUN-3way-512 (51.4ms, 32.4% mAP) is 26% slower but 3.6% better in mAP than the SSD512 model (40.8ms, 28.8% mAP). It is about 44% faster than RetinaNet-50-500  (73ms, 32.5% mAP) that has a similar performance.
In addition, we measured FPS of our methods on Titan X Pascal with the other environment kept the same. Table 5 shows that even the most complex version of our method, RUN3WAY512, can works in real time (29.8 FPS) on Tital X Pascal.
The proposed RUN architecture for object detection was originated from the awareness of the contradictory requirements for multi-scale features that they should contain low-level information on an image as well as high-level information on objectness. The proposed 3-way Resblock alleviated the gradient exploitation problem and enriched contextual information, an important element of prediction. We also showed that the generalization performance of multi-scale prediction can be improved by integrating the separate prediction modules into one unified prediction module. This approach, which can be seen to be somewhat simple, resulted in outstanding performance on the PASCAL VOC test. The results on COCO dataset also show how fast and efficient our algorithms are. We expect the proposed method be not restricted to SSD-based methods but also applicable to other structures utilizing multi-scale features.
Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2874–2883. IEEE, 2016.
Robust real-time face detection.International journal of computer vision, 57(2):137–154, 2004.