Aggregating multi-scale information is critical for object detection models to exploit context and achieve better performance in challenging conditions, especially for Single Shot Detector (SSD) . Different from the conventional two-stage object detectors (e.g. RCNN , Faster RCNN ) and single-stage detectors (e.g. YOLO  and YOLO-v2 
) that detect objects based on features extracted from the last layer, SSD detects objects from both shallow and deep layers. Under the SSD framework, detectors placed in shallow layers are responsible for detecting small objects, while detectors placed in deeper layers are responsible for detecting larger objects. Such a design significantly improves the detection accuracy on small objects, since fine features from the shallow layers contain richer details in much higher resolution, which may be missed by coarse features from top layers due to down-sampling. However, it still performs not so well on detecting difficult objects, like bottles. This is because the features from shallow layers have a limited receptive filed and all decisions are made based only on the local area. These features cannot perceive global context information from surrounding areas for context reasoning and support making accurate decisions. Integrating information from other scales helps widen the receptive field, thus can alleviate such ambiguities and reduce information uncertainty in the local area.
However, the accuracy improvement brought by introducing multi-scale information does not come for free. Existing multi-scale information fusion techniques always introduce extra components to the existing network which lead to significant speed drop. See Figure 1. One of the main extra components consuming a large amount of computational resource is the up-sampling unit which is mostly conducted by deconvolution layers using computational extensive kernels, e.g. kernels. Besides, the extra convolution layers also cost large computational resource, which are used for gathering and fusing information from different scales. Both of them are essential. The up-sampling unit helps match feature maps to the same scale and the extra convolution layers refine the features before sending them to the detector. Studies on how to reduce the computation consumption by building a more efficient information fusion unit without degrading performance are still rare.
In this work, we propose to build a light-weight information fusion architecture that can effectively fuse multi-scale information without consuming much computational resource. We reveal that combining information from both lower level finer features and higher level coarser features can lead to more efficient fusion. Then we propose a novel context reasoning architecture which enjoys a smoother information flow by only considering adjacent scales and controllable complexity with an iterative inference mechanism. The proposed light-weight information weaving architecture is called WeaveNet, which iteratively conducts multi-scale reasoning with only information from adjacent scales and progressively fuses the long-range information across multiple scales. It does not require batch normalization. Therefore, a deeper backbone network, such as DPN-131 , can be adopted for further improving the accuracy as long as the GPU memory can accommodate 1 image per mini-batch. More importantly, WeaveNet is highly efficient and can gradually improve the performance by simply performing more iterations, as shown in Figure 1. We apply WeaveNet for object detection on PASCAL VOC 2007 and 2012 benchmarks. It achieves 79.5% mAP on PASCAL VOC 2007 with processing speed as fast as 101 fps.
In the following sections, we will first revisit existing multi-scale methods and highlight the uniqueness of the proposed WeavNet. Then we will introduce the proposed WeaveNet detailedly and evaluate its performance on benchmark datasets.
2 Related Work
Recently, the single-stage detector has attracted increasingly more attention due to its simplicity and high detection speed, compared with two-stage detectors, e.g. Faster RCNN  and RFCN . However, despite their advantages, single-stage detectors usually do not perform well for detecting difficult and small objects.
Different from many two-stage detectors and other single-stage models (e.g. YOLO ), SSD  detects objects at multiple scales for suiting their sizes: small objects are detected at shallow layers with low-level high-resolution features, while large objects are detected in top layers with high-level low-resolution features. Such a design reduces the demand of using very large input size, e.g. as commonly used in Faster RCNN , for keeping rich feature details for the top layers and thus significantly reduces the computational cost. However, it brings another problem on detecting difficult and small objects. Since features in lower layers have a much smaller receptive field, the network cannot perceive a boarder view to utilize more global context information and suffer from ambiguity and insufficient context exploration.
To improve the accuracy of SSD on detecting difficult and small objects, various strategies [5, 13, 22, 9, 20] have been proposed to introduce multi-scale information to the conventional SSD framework. One main stream is to attach a top-down pyramid-like structure to propagate information from top layers to bottom layers to enlarge the receptive filed of each shallow layer. For example, the Deconvolutional Single Shot Detector  uses a deconvolutional module to enlarge the scale of top layers and adds them back to the shallow layer features, followed by several extra convolutional layers for fusing the merged information. The very recently proposed StairNet  and RetinaNet 
share a similar idea. But they adopt slightly different strategies for choosing a proper adaptive layer before the element-wise sum and they use more effective blocks to conduct further inference on the merged information. However, in order to enable enough information to flow to the final bounding box regressor and classifier, all of the new attached layers usually have large input and output channel sizes,e.g. , leading to considerable amount of computational cost.
Another stream of utilizing multi-scale information is to consider both low-level and high-level information. The main idea is: in addition to introducing information from top layers to enlarge the receptive fields, they also pass more detailed local information to top layers for making bound box localization and category inference more precise. However, most of existing methods following this stream cost more computational resource since information from all scales (other than the target scale) are merged together simultaneously. Rainbow SSD (RSSD)  proposes to utilize both low- and high-level information by concatenation, which increases the final fused information from hundreds of channels to 2,816 channels, introducing significant computational cost in the following layers. Besides, the Recurrent Rolling Convolution Network  proposes to recurrently forward information from top to bottom and bottom to top. However, the inner state-to-state adaptive layers are quite computationally expensive and also significantly slow down the speed compared with the vanilla SSD.
Different from previous works, we propose a weaving structure for multi-scale information fusion. The proposed WeaveNet is naturally friendly to optimization and does not require the batch normalization layer to ease the training. It is also highly parallel in each iteration and costs much lower computation, which is preferable for real-time application.
3.1 Information Weaving
The conventional Single Shot Detector uses features extracted from multiple resolutions to detect objects at various scales. However, as discussed above, the receptive filed in shallower layers can only cover limited local areas, which makes it hard to conduct complex context reasoning based on global features. Moreover, features from higher layers passing through several down-sampling stages may lose detailed information for precise localization. In this work, we propose to introduce features from both lower and higher layers through a novel information weaving architecture to overcome both drawbacks without efficiency decline.
As shown in Figure 2, the idea of the proposed information weaving structure is to gradually weave the information from adjacent scales for the detector in the current scale. Here, the “gradually” means that only information from adjacent scales is taken into account, since we believe current scale should focus more on adjacent scales instead of those faraway at the very first iterations. We propose to integrate the long-range information by an iterative inference process, allowing the information to propagate from neighbors to its neighbors to the current scale. By weaving information iteratively, sufficient multi-scale context information can be transferred and integrated to the current scale thoroughly.
3.2 Network Architecture
In the main architecture, the right part shows our proposed context information weaving structure, while the left part shows the raw features extracted from the backbone CNN. The WeaveNet takes the raw features extracted from the backbone network as input and outputs the refined features. Each of the refined features is then attached by two separate convolution layers for bounding box regression and classification respectively, same as the vanilla SSD. Within the proposed WeaveNet, information are gathered and fused iteratively. Each scale extracts useful information from both of its higher adjacent layers (with low resolution) and lower adjacent layer (with high resolution). These information is added to the current state for next decision making after going through a block
for information fusion and inference. During the iterative refinement, besides propagating information to longer range scales, the information can also propagate back to its own scale for making more complex reasoning and introducing greater non-linearity. To make information from adjacent scales be compatible with current scale, we use a bi-linear interpolation for enlarging the feature maps and usemax pooling with stride for reducing the feature maps. Both of them are parameter-free and introduce negligible computational cost.
Details about the inner block architecture are shown in Figure 3, which is designed to be as light as possible for reducing computational cost. At the beginning of the bock, information from both higher and lower layers are concatenated together to form inputs to the block, which contains information from all previous iterations including the raw features from the backbone network. Then, all the information is passed though a convolution layer with a ReLU activation layer for a spatial non-linear fusion. The output is then passed to its neighborhood for next iteration. We repetitively stack these simple blocks for modeling and aggregating more complex and richer context information. Based on our experiments, the most significant improvement comes from the 1st iteration and usually its maximum performance is achieved at the 5th iteration.
We use a reduced VGG16 network as the backbone and attach the proposed WeaveNet for multi-scale information fusion. The overall setting follows the original SDD for fair comparison, where we extract the features from conv4_3, fc6, and add 6 more layers after the fc6. In our training platform, the existing bi-linear interpolation is implemented by using channel-wise devconvolution, which can only support integer scale factor as the pooling layer. Thus, we set the input image size as , so that the size of each scale is respectively where the scales are reduced by a factor of 2 in the first 4 scales. In this way, it is easier to attach our proposed WeaveNet. Since the last two scales are small, one convolution kernel is able to cover the whole map. Hence, we do not refine them further and keep them the same as the vanilla SSD.
3.3 Architecture Simplification
Single Shot Detector is known for its fast speed as well as high accuracy. Losing either of these advantages would make the detector less favorable. In this subsection, we elaborate on how to accelerate the WeaveNet by grouping fragmented computation together to reduce the data allocation and communication cost and increase the hardware usage rate by reformulating the network topology.
Suppose is the raw feature extracted from the -th scale of the backbone network. Let be the input and be outputs of the Block shown in Figure 3 in the -th iteration. Here will be sent to the lower scale and will be sent to the upper scale. Then, the convolution operation in Figure 3 can be formulated as
where , are the convolutional kernel and bias of the lower layer respectively, , are the convolutional kernel and bias for the upper layer, and
is the ReLu activation function shown in Figure3.
We propose to group the computation by and . Then, Eqn. (1) can be simplified as
Further splitting into gives
The computation in each block can be separated into two parts. One depends on the previous states and the other part does not. Thus the latter part can be grouped and pre-computed for further acceleration, i.e. .
Figure 4 shows the corresponding simplified architecture, where fragmented computations are grouped together to reduce the unnecessary data allocation and communication cost and increase the hardware usage rate for acceleration. More specifically, different from the topology shown in Figure 3, only the outputs from adjacent scales are gathered by concatenation. Within the block, these inputs would all be sent to a single transformer. The output is element-wisely summed with a pre-computed source and then split and sent to its adjacent scales. Both inputs and outputs of each block are usually in low dimension, e.g. 32. When only considering the pure computational cost, such computational cost of adding one block can be more than less than adding a single layer with both input and output channels equal to for fusing multi-scale information.
We implement the proposed WeaveNet based on vanilla SDD 
using the same version of Caffe. Following the original SSD, we adopt the same optimization strategy and train the same number of iterations for fair comparison. In particular, all networks are trained with batch size of 32, and the learning rate is reduced from to by at the K and K iterations and terminated at the K iterations. Frames Per Second (fps) is reported on both Nvidia Titan X (Maxwell) GPU and Nvidia Titan X (Pascal) GPU with the same NVIDIA Library, i.e. CUDA 8.0 cuDNN v5.1.
In the experiments, we evaluate the WeaveNet on the widely used PASCAL VOC 2007 and PASCAL VOC 2012 benchmarks . In order to provide more insights, we first conduct controlled experiments on the PASCAL VOC 2007 benchmarks to study the properties of WeaveNet. Then, based on the results, we test the proposed WeaveNet with its best settings on both benchmarks and compare it with existing state-of-the-art object detection methods through an in-depth analysis.
4.1 Importance of Coarse and Fine Features
We start with an ablation study to investigate the importance of top-down and bottom-up information propagation. Since top-down information propagation has already been observed to be important by many papers [9, 12, 13, 22], in our experiments, we are more concerned about whether further introducing finer features from lower layers would improve the accuracy.
|Method||Settings||Input Size||Top-down||Bottom-up||mAP (small)||mAP (medium)||mAP (large)||mAP (overall)|
The ablation study is designed by blocking either the bottom-up information path or the top-down information path individually to study effects of different components. In the experiments, the number of iteration is set to be for WeaveNet. Table 1 shows the experiment results. As can be seen from the last column, the “top-down” information propagation improves the overall mAP by as expected, while the “bottom-up” information propagation can also improve the overall mAP by . By utilizing both top-down and bottom-up information, it further improves the mAP to , indicating that both top-down and bottom-up feature are important.
To further investigate effects of introducing top and bottom features, we follow  to sort the testing images into different scales by the area of ground truth bounding box and evaluate mAP w.r.t. specific object size. Specifically, the ground truth bounding boxes are divided into three parts per class: i.e. . When doing evaluation on a specific scale, ground truth bounding box on other scales are ignored. The results are shown in the 6th to 8th columns in Table 1. As can be seen from the results, the “Top-down” information significantly benefits small object detection. Further introducing the fine detailed features from bottom layers help detect medium objects. However, it is interesting to see that once both top and bottom features are considered, the detection accuracy on large objects slightly drops.
We visualize the testing results in Figure 5 to compare the detection results of each detector. As can be seen from Figure 5 (d), smaller and more difficult objects are detected correctly, compared with other strategies in Figure 5 (a)-(c).
The above ablation studies verify our conjecture that combining both low-level feature and high-level features can further improve object detection accuracy. Thus, in the following experiments, we use the full version of WeaveNet to further study its properties and compare it with state-of-the-arts.
4.2 Effectiveness and Efficiency
The most attractive advantage of using a single stage detector is the speed. Different from the two-stage object detectors which require an object proposal generation stage and an object prediction stage, the single stage detector directly performs the prediction and thus saves a huge amount of computation resource. It would be less useful if the attached multi-scale fusion component loses the speed advantage with only slight performance improvement. In experiments here, we first study the performance improvement brought from each individual component and then adjust the number of iteration in the WeaveNet to study the speed-accuracy tread-off comparing with state-of-the-art multi-scale fusion methods.
|use anchor [2,3]||✓||✓||✓||✓|
Width of WeaveNet
To study the effectiveness of introducing different amount of scale information from adjacent scales, we vary the channels size of concatenated adjacent channels, noted as , from to . The larger the number is, the richer information can be introduced for each iteration. As can be seen from the fourth column Table 2, the accuracy consistently improves. However, the speed is slightly decreased because of the higher computational complexity.
Number of anchor boxes
The regressor of single shot detector predicts the offset w.r.t. its default anchors. The original anchor setting is to use 5 different aspect ratios, i.e. , for the first one and last two scales, and use 3 aspect ratios i.e. , for the middle scales. However, as can be seen from the third column of Table 2, introducing more anchors by using five kinds of aspect ratios for all six scales can also improve the accuracy by about , with a slight computation overhead.
Bounding box refinement
We refine the final bounding box location by conducting a bounding box refinement after the NMS stage. Instead of directly using the NMS output, we further refine the location of each bounding box using a weighted sum with its surrounding boxes whose IoU is greater than 0.6. The weight is set to be the score of each box. We deploy the bounding box refinement upon a well trained model. The 2nd column in Table 2 shows improvement of the proposed bounding box refinement technique. In our experiment, we find the proposed refinement technique can always help gain mAP compared with bounding box without refinement.
|Method||Backbone||fps (Maxwell)||fps (Pascal)||mAP (%)|
|Size = 1||Size = 8||Size = 1||Size = 8|
Iterative information weaving
One major attractive property of the proposed information weave structure is: by gradually weaving information from different scales, the receptive filed is enlarged and the learning ability of the whole network is increased. To investigate effectiveness of such an iterative inference procedure, we design another set of experiments by varying the number of iterations of the proposed WeaveNet while keeping other components fixed. The results are summarized in Table 3. As can be seen from the results, the accuracy keeps increasing along with more iterations, demonstrating the proposed information weaving architecture can effectively fuse and refine the multi-scale information. The performance boost is also observed for other settings consistently. We also plot the results in Figure 1, where the solid line represents the same WeaveNet but using different number of iterations. The speed does gradually decrease when using more iterations. However, when comparing with other existing multi-scale fusion techniques in the first block of Table 3, WeaveNet still shows significant superiority in both speed and accuracy.
4.3 Results on PASCAL 2007
In this subsection, we show the performance comparison between the WeaveNet and state-of-the-art object detection models. All of the models are trained on the union set of PASCAL VOC 2007 trainval and VOC 2012 trainval, and evaluated on PASCAL VOC 2007 test set.
|DSSD 321 ||ResNet101||78.6||81.9||84.9||80.5||68.4||53.9||85.6||86.2||88.9||61.1||83.5||78.7||86.7||88.7||86.7||79.7||51.7||78.0||80.9||87.2||79.4|
Table 4 shows the results of comparing WeaveNets with the state-of-the-art models. As can be seen from the results, WeaveNet achieves mAP, much higher than the mAP achieved by vanilla SSD. Comparing with other state-of-the-art multi-scale fusion methods, e.g. StariNet, WeaveNet further improves the mAP by . We note that WeaveNet in Table 4 has and , which is slower than WeaveNets using less iterations. As can be seen in Figure 1, we found the WeaveNet with and actually achieves slightly better speed and accuracy tread-off, with in fps (batch size = 1) using TITAN X (PASCAL) GPU.
4.4 Results on PASCAL 2012
We also evaluate the proposed method on PASCAL VOC 2012 benchmark, where more training samples are included and the testing set is replaced by a more difficult set which has about 2 times more testing images than PASCAL VOC 2007. We did not extensively tune the training parameter and all models are trained in the same number of iterations following exactly the same learning rate and weight decay as used on PASCAL VOC 2007. The training set is a union of PASCAL VOC 07 trainval + test and PASCAL VOC 2012 trainval, and the final trained model is submitted to online testing server for evaluation.
|DSSD 321 ||ResNet-101||76.3||87.3||83.3||75.4||64.6||46.8||82.7||76.5||92.9||59.5||78.3||64.3||91.5||86.6||86.6||82.1||53.3||79.6||75.7||85.2||73.9|
The evaluation results are summarized in Table 5. As can be seen from the results, our proposed method surpasses the competitors under the same backbone network with a large margin. In particular, a Reduced VGG16 base WeaveNet surpass the vanilla SSD by and improves the overall mAP by upon the one of the strongest multi-scale method — StairNet .
We also visualize some detection results on the testing set in Figure 6. Compared with the vanilla SSD, WeaveNet shows stronger ability for detecting both tiny and difficult objects. For the objects with medium sizes, WeaveNet provides more accurate bounding boxes which fit the objects tightly thanks to its unique iterative information weaving.
In this work, we observe that both fine information from lower layers and coarse information from higher layers are crucial for building a highly efficient object detector. We propose a novel multi-scale fusion architecture, named WeaveNet. WeaveNet iteratively weaves information from adjacent scales, which not only gradually increases the detector receptive filed but also smoothly introduces more fine details from lower layers for robust and precise bounding box prediction. It can be easily trained and deployed without batch normalization and consumes very little additional computational cost, which make it superior over existing multi-scale fusion methods. The experimental results well demonstrate the remarkable speed and accuracy advantages of the proposed WeaveNet on PASCAL VOC 2007 and PASCAL VOC 2012 dataset. In the near further, we would like to further evaluate the proposed WeaveNet on challenging MS COCO benchmark , and release all the trained models and source codes on GitHub.
S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick.
Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.In , pages 2874–2883, 2016.
-  Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In Advances in Neural Information Processing Systems, pages 4468–4476, 2017.
-  J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
-  C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
-  S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE International Conference on Computer Vision, pages 1134–1142, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence, 38(1):142–158, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  J. Jeong, H. Park, and N. Kwak. Enhancement of ssd by concatenating feature maps for object detection. arXiv preprint arXiv:1705.09587, 2017.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 845–853, 2016.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv preprint arXiv:1612.03144, 2016.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
-  C. Ning, H. Zhou, Y. Song, and J. Tang. Inception single shot multibox detector for object detection. In Multimedia & Expo Workshops (ICMEW), 2017 IEEE International Conference on, pages 549–554. IEEE, 2017.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
-  J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. arXiv preprint arXiv:1612.08242, 2016.
-  J. Ren, X. Chen, J. Liu, W. Sun, J. Pang, Q. Yan, Y.-W. Tai, and L. Xu. Accurate single stage detector using recurrent rolling convolution. arXiv preprint arXiv:1704.05776, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  S. Woo, S. Hwang, and I. S. Kweon. Stairnet: Top-down semantic aggregation for accurate one shot detection. arXiv preprint arXiv:1709.05788, 2017.
-  W. Xiang, D.-Q. Zhang, V. Athitsos, and H. Yu. Context-aware single-shot detector. arXiv preprint arXiv:1707.08682, 2017.
-  H. Zhou, Z. Li, C. Ning, and J. Tang. Cad: Scale invariant framework for real-time object detection.