MDSSD: Multi-scale Deconvolutional Single Shot Detector for small objects

In order to improve the detection accuracy for objects at different scales, most of the recent works utilize the pyramidal feature hierarchy of the ConvNets from bottom to top. Nevertheless, the weak semantic information makes the bottom layers poor in detection, especially for small objects. Furthermore, most of the fine details are lost on the top layers. In this paper, we design a Multi-scale Deconvolutional Single Shot Detector for small objects (MDSSD for short). To obtain the feature maps with enriched representation power, we add the high-level layers with semantic information to the low-level layers via deconvolution Fusion Block. It is noteworthy that multiple high-level layers with different scales are upsampled simultaneously in our framework. We implement the skip connections to form more descriptive feature maps and predictions are made on these new fusion layers. Our proposed framework achieves 78.6 at 38.5 FPS with only 300*300 input.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 13

page 14

page 15

08/01/2019

ScarfNet: Multi-scale Features with Deeply Fused and Redistributed Semantics for Enhanced Object Detection

Convolutional neural network (CNN) has led to significant progress in ob...
07/17/2017

Residual Features and Unified Prediction Network for Single Stage Detection

Recently, a lot of single stage detectors using multi-scale features hav...
09/29/2020

Attentional Feature Fusion

Feature fusion, the combination of features from different layers or bra...
04/20/2019

Data-Driven Neuron Allocation for Scale Aggregation Networks

Successful visual recognition networks benefit from aggregating informat...
09/30/2019

SymmetricNet: A mesoscale eddy detection method based on multivariate fusion data

Mesoscale eddies play a significant role in marine energy transport, mar...
12/10/2019

MDFN: Multi-Scale Deep Feature Learning Network for Object Detection

This paper proposes an innovative object detector by leveraging deep fea...
12/19/2021

Parallel Multi-Scale Networks with Deep Supervision for Hand Keypoint Detection

Keypoint detection plays an important role in a wide range of applicatio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Object detection has always been the focus and challenge in the field of computer vision, including the detection for small objects. To recognize objects at various scales, the majority of previous detectors are based on hand-crafted features

[1, 2] utilizing image pyramids (see Fig. 1(a)). Those works are computationally expensive considering memory and inference time. With the arrival of deep convolutional networks (ConvNets [3]), the performance of object detection has been improved significantly. However, small object detection is still a challenging issue due to relatively small area with less information in images.

The scale of representations is of crucial importance for detection task. In recent years, the hand-engineered features have been replaced with features computed by convolutional neural networks. Recent detection systems

[4, 5, 6] leverage the top-most feature maps computed by ConvNets on a single input scale to predict candidate bounding boxes with different scales and aspect ratios (see Fig. 1(b)). However, the top-most feature maps have the fixed receptive field, which conflicts with objects at different scales in natural images. In particular, there is little information left on the top-most features for small objects, therefore it may compromise object detection performance.

To address the multi-scale problems, SSD [7] and MS-CNN [8] utilize the pyramidal feature hierarchy from bottom to top (see Fig. 1(c)) to adapt to objects of various sizes. Nevertheless, the layers from the bottom ConvNets have weak semantic information, which will harm their representational capacity for small object recognition. The most recent networks [9, 10] try to observe and utilize the pyramidal features to a large extent by building a top-down architecture with lateral connections (see Fig. 1(d)). These networks show dramatic improvements in accuracy compared with conventional detectors. However, we note that these methods utilize the deconvolution layer from the top-most feature maps which have totally lost the fine details for small objects. On the other hand, the systems based on fusion features implement connections for every prediction layer, which means more additional layers result in more computational cost at the same time, making it impractical for real application considering the inference time.

(a) Featurized image pyramids
(b) Single feature map
(c) Pyramidal feature hierarchy
(d) Feature pyramid network
(e) Multi-scaled fusion module
Fig. 1: (a) The detectors utilize image pyramids as input to compute a multi-scale feature representation. (b) The detectors utilize single scale feature to make predictions. (c) The detectors utilize pyramidal feature hierarchy to replace featured image pyramids. (d) A top-down architecture with lateral connections. (e) Our proposed multi-scale fusion module with skip connections.
Fig. 2: The architecture of MDSSD. First we apply deconvolution layers to the high-level semantic feature maps at different scales (i.e., conv8_2, conv9_2, and conv) simultaneously. Then we build skip connections with lower-layers (conv3_3, conv4_3, and conv7) through Fusion Block and form 3 new fusion layers (Module 1, Module 2, and Module 3). Predictions are both made on the new fusion layers (Module 1, Module 2, and Module 3) and the original SSD layers (conv8_2, conv9_2, conv, and conv11_2) at the same time.

In this paper, we dedicate an effort to improve the detection performance for small objects and maintain the inference speed at the same time. The low-level features within a ConvNet are more accurate for object location due to small receptive fields and less downsampling. Nevertheless, the weak semantic information makes the low-level features poor in classification, especially for small objects. With this in mind, we add the high-level features with semantic information to the low-level features via deconvolution Fusion Block to obtain the feature maps with rich information (see Fig. 1(e)). We take the state-of-the-art object detector, Single Shot Multibox Detector, as the base framework, and then add multi-scale deconvolution fusion module. We propose the Multi-scale Deconvolutional Single Shot Detector for small objects, named MDSSD.

FPN [9] and DSSD [10] utilize the deconvolution layer from the top-most feature maps which have lost most of the fine details for small objects. And the following deconvolution modules completely depend on the last convolution layer, increasing a heavy burden on the top-most layer. Different from these architectures, we try to make full use of the multi-scale convolution layers that still have enough details for small objects with semantic information. Therefore, we implement deconvolution layers on multi-scale features before the top-most layer, as shown in Fig. 2. Afterwards, we merge them with some of the bottom features to form more semantic feature maps. In order to improve the performance of deep neural network for small object detection, we intentionally add conv3_3 output by the backbone network for prediction. To avoid additional cost we only conduct Fusion Block between two layers with various scales and Fusion Module for low-level features.

The backbone network we choose remains VGG16, instead of the deeper ConvNets (e.g. ResNet [11] or DenseNet [12]). The reason is that deeper ConvNet is harmful to small object location and inference speed. The proposed MDSSD framework turns out to be rather influential for small objects, and it can meet the real-time application as well. The main contributions of our work are summarized as follows:

  • We propose a novel feature fusion framework for small object detection. The deconvolution layers are applied to the semantic high-level features from different depths, yielding higher resolution features. And then we merge them with the low-level features to achieve skip connections.

  • We design several delicate multi-scale deconvolution Fusion Modules. The new fusion features are rich in semantic information with relatively high resolution, providing a significant improvement on detection of small objects.

  • By conducting quantitative and qualitative experiments, we prove that the proposed MDSSD achieves state-of-the-art performance on benchmark datasets of PASCAL VOC2007 and MS COCO. Moreover, it improves the detection accuracy for small objects with a large margin and a slightly degraded speed.

Ii Related Work

Most of traditional methods for object detection are based on the sliding-window paradigm, using the hand-crafted features, such as Haar [13] and DPM [14]

. With the development of ConvNets in recent years, the accuracy and inference speed of detection have been greatly improved by integrating feature learning and classifier into one framework. We classify these works based on ConvNets into the following three categories:

The detectors based on the top-most features.  OverFeat [15] applies a sliding window to the feature maps to create bounding boxes, decomposing the detection into localization and classification. It tends to be costly. SPPnet [6] designs the Spatial Pyramid Pooling layer so that the input images of any sizes are feasible, which is efficient in computation. R-CNN [16] and Fast R-CNN [4] use selective search to generate bounding boxes, extracting features with CNN and classifying them by SVM. Faster R-CNN [5] uses RPN (Region Propose Network) to directly generate the anchor boxes with different scales and aspect ratios on the feature maps, improving effectiveness and efficiency. In [ST], a Spatio-Temporal Closed-Loop object detector is proposed for object detection in video sequences. YOLO [17] divides the input image into regions, and then regresses and classifies the bounding boxes in each region at real-time speed. However, all these methods are based on the top-most features of convolutional neural network to locate and classify the objects. They relies on the information extracted by the upper features to a large extent and do not make full use of the bottom details.

The detectors based on multi-scale features.  To make full use of the multifarious information from different convolution layers to cover the objects with different scales and shapes, a set of approaches [7, 8, 18, 19, Edge] make predictions on multi-scale features. SSD [7], a single shot detector, makes predictions by using small convolutional filters of 33 on six features at different depths from bottom to top. It is one of state-of-the-art detectors considering both accuracy and speed. MS-CNN [8] proposes a framework which consists of a proposal sub-network and a detection sub-network. In the proposal sub-network, detection is performed at multiple output features. Deconvolution layer is introduced to upsample feature maps and add contextual information. Nevertheless, the layers from the bottom of a ConvNet have weak semantic information, which will harm their representational capacity for small object recognition.

The detectors based on combinations of multi-scale features.  In order to enrich the feature maps, a number of approaches [20, 21, 22, 23, 24, RGBD, GL] concatenate multi-scale features of ConvNets to increase context information. The recent methods, FPN [9] and TDM [25], adopt top-down pathway and conduct skip connections in their architectures to enhance the power of features. DSSD [10] applies deconvolution layers to the top of SSD to realize upsampling and then achieves connection with convolutional feature maps. Predictions are made on these new fusion feature maps with context information. These bottom-up and top-down architecture are also utilized on semantic segmentation [26]

, and human pose estimation

[27].

Inspired by these researches, we propose MDSSD for small object detection. It combines high-level and low-level features to add contextual information for small object detection. The difference is that our deconvolution layers are not applied to the top-most of the ConvNet, but to multiple top features with different scales simultaneously. Then we merge them with some bottom layers to form new features which are more informative.

Iii Multi-scale Deconvolutional Network

In this section, we first review the powerful SSD framework briefly. Then we introduce the principle of the proposed Multi-scale Deconvolutional Single Shot Detector. Afterwards, we analyze how to design the Fusion Block. Finally, we discuss the training policy.

Iii-a Ssd

Please refer to [7] for the overall architecture of SSD with input. It takes the standard VGG16 [28] as feature extractor, and adds extra convolution layers to the truncated backbone network. SSD utilizes the pyramidal feature hierarchy within a ConvNet to predict objects with different scales. Predictions at multiple scales improve the mAP, and single-shot architecture achieves real-time requirements. However, it is hard for SSD to classify the small objects owing to the weak semantic information on the shallow features. Therefore, it is imperative to yield more semantic feature maps for small object detection.

Fig. 3: Detecting small objects such as the sheep in the image requires the fine details from the shallow layers. The area of the sheep is about in the original image, and it only remains on conv8_2 due to severe downsampling. As for smaller objects, the representation of fine details will be weaker and weaker after conv8_2 layer.

Iii-B MDSSD Architecture

Fusion
Module
Module 1 Module 2 Module 3
Connection
Layers
conv3_3 conv8_2 conv4_3 conv9_2 conv7 conv10_2
Structure
300300
33256 Conv
L2 Norm
2
33256 Deconv
33256 Conv
L2 Norm
33512 Conv
L2 Norm
2
22512 Deconv
33512 Conv
L2 Norm
331024 Conv
L2 Norm
2
331024 Deconv
331024 Conv
L2 Norm
Structure
512512
33256 Conv
L2 Norm
2
22256 Deconv
33256 Conv
L2 Norm
33512 Conv
L2 Norm
2
22512 Deconv
33512 Conv
L2 Norm
331024 Conv
L2 Norm
2
221024 Deconv
331024 Conv
L2 Norm
Fusion
Eltw-sum
Relu
33256 Conv
Relu
Eltw-sum
Relu
33512 Conv
Relu
Eltw-sum
Relu
331024 Conv
Relu
TABLE I: The structure of three Fusion Blocks with and input. []

2 indicates double identical operations. The stride is 2 for all deconvolution layers, and 1 for convolution layers

Empirically, we define an object as small when the area it occupies in images is smaller than (the area is measured as the number of pixels in the segmentation mask). As shown in Fig. 3, the area of a sheep is about , and we could obtain the fine details only on the shallow layers within the ConvNet (conv3_3 – conv7). The representation of fine details for the sheep will become weaker and weaker on the following several layers and will be totally lost on the coarse, semantic deepest layer (conv11_2). We intend to make full use of the shallow layers with rich fine details and the relatively deep layers with semantic information as well as some fine details for small objects.

Fig. 2 is the overall framework of MDSSD for 300

300 input. As we have analyzed earlier, the shallow feature maps (conv3_3 – conv7) inherently have small receptive fields, and they are mainly responsible for small object detection. We add the high-level features to these low-level features to make them more semantic and informative through Fusion Block. In order to share the structure of Fusion Block, we delicately choose the deep layers to design symmetrical connections with these shallow layers. That is to say, the deep features have the same downsampled factor with the shallow features on spatial resolution. Specifically, conv9_2 and conv10_2 are upsampled through Fusion Block and then merged with conv4_3 and conv7 respectively. The new fusion feature maps, named Module 2 and Module 3, will replace the original conv4_3 and conv7 of SSD. In order to further improve the performance of small object detection, it is necessary to take full advantage of underlying feature maps. Therefore, we add Module 1 which connects the lower features (conv3_3) and high-level features (conv8_2) to make prediction.

In summary, we have prediction layers at different depths in total, including fusion modules (Module 1, Module 2, and Module 3) and original SSD prediction layers (conv8_2, conv9_2, comv10_2, and conv11_2). Then we apply ( is the channels for a feature layer) small kernels to produce the score and shape offset for a specific bounding box. Non-maximum suppression (nms) with a confidence threshold of 0.01 and jaccard overlap of 0.45 is performed to filter out most of the bounding boxes during inference. Finally, we retain the top 200 detections.

Iii-C Fusion Block

There are three fusion modules at different depths in Fig. 2. We take Module 1 as an example here. Fig. 4 shows an illustration for 300300 input model. The feature maps should have the same size and channels if we use element-wise product or summation to merge them together. Therefore, in order to fuse conv3_3 and conv8_2, we need to upsample the spatial resolution of conv8_2 by a factor of .

Fig. 4: Deconvolution Fusion Block.

Specifically, for conv8_2 shown in Fig. 4, we implement three deconvolution layers with stride 2 to achieve upsampling, producing output maps of the same size with conv3_3. The kernel size of deconvolution layer is or with 256 outputs. The deconvolution layers are followed by convolution layers, L2 normalization layers, and ReLU layers. Conv3_3 undergoes one 33 convolution layer followed by L2 normalization layer. We merge them by element-wise summation after the normalization layer. Then we add one convolution layer to ensure the discriminability of features for detection. Finally we achieve the fusion features (Module 1) after one ReLU layer.

As we mentioned before, the symmetric connections enable Module 2 and Module 3 to follow the identical principle. The dimensions for the three modules are 256, 512 and 1024, respectively. As for 512512 input model, there are some tiny modifications. Table I sketches the structure details of Fusion Block with 300300 and 512512 input.

Method network mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Faster[5] VGG 73.2 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6
ION[21] VGG 75.6 79.2 83.1 77.6 65.6 54.9 85.4 85.1 87 54.4 80.6 73.8 85.3 82.2 82.2 74.4 47.1 75.8 72.7 84.2 80.4
Faster[11] Residual-101 76.4 79.8 80.7 76.2 68.3 55.9 85.1 85.3 89.8 56.7 87.8 69.4 88.3 88.9 80.9 78.4 41.7 78.6 79.8 85.3 72.0
MR-CNN[8] VGG 78.2 80.3 84.1 78.5 70.8 68.5 88.0 85.9 87.8 60.3 85.2 73.7 87.2 86.5 85.0 76.4 48.5 76.3 75.5 85.0 81.0
R-FCN[29] Residual-101 80.5 79.9 87.2 81.5 72.0 69.8 86.8 88.5 89.8 67.0 88.1 74.5 89.8 90.6 79.9 81.2 53.7 81.8 81.5 85.9 79.9
SSD300*[7] VGG 77.5 79.5 83.9 76.0 69.6 50.5 87.0 85.7 88.1 60.3 81.5 77.0 86.1 87.5 83.97 79.4 52.3 77.9 79.5 87.6 76.8
SSD512*[7] VGG 79.5 84.8 85.1 81.5 73.0 57.8 87.8 88.3 87.4 63.5 85.4 73.2 86.2 86.7 83.9 82.5 55.6 81.7 79.0 86.6 80.0
DSSD321[10] Residual-101 78.6 81.9 84.9 80.5 68.4 53.9 85.6 86.2 88.9 61.1 83.5 78.7 86.7 88.7 86.7 79.7 51.7 78.0 80.9 87.2 79.4
DSSD513[10] Residual-101 81.5 86.6 86.2 82.6 74.9 62.5 89.0 88.7 88.8 65.2 87.0 78.7 88.2 89.0 87.5 83.7 51.1 86.3 81.6 85.7 83.7
MDSSD300 VGG 78.6 86.5 87.6 78.9 70.6 55.0 86.9 87.0 88.1 58.5 84.8 73.4 84.8 89.2 88.1 78.0 52.3 78.6 74.5 86.8 80.7
MDSSD512 VGG 80.3 88.8 88.7 83.2 73.7 58.3 88.2 89.3 87.4 62.4 85.1 75.1 84.7 89.7 88.3 83.2 56.7 84.0 77.4 83.9 77.6
TABLE II: Detection results on PASCAL VOC2007 test set. SSD300* and SSD512* indicate the latest version updated by the authors. All the methods are trained on VOC2007 and VOC2012 trainval, and tested on VOC2007 test

Iii-D Training

Data Augmentation. The data augmentation strategies utilized in SSD are also applied in our framework for building a robust model. In the latest version of SSD, a “zoom out” operation is implemented to improve the performance for small objects. We exploit both the original images and generated samples by randomly expanding and cropping for training. Please refer to SSD for more details.
Default Boxes. For Module 2 and Module 3 in MDSSD in Fig. 2, the scales and aspect ratios of defaults boxes are consistent with conv4_3 and conv7 in SSD, respectively. The scale of Module 2 is set to 0.2, and the scale of the highest layer is 0.9. Default boxes with aspect ratios of 1, 2, 3, , and are generated to match different objects. For Module 2, conv10_2, and conv11_2, each cell of the feature maps predicts four default boxes. Others have six default boxes for each location. As for Module 1 we add, we keep them the same as Module 2 both in scale and aspect ratio. Following the strategy in SSD, we add extra conv12_2 for 512512 input model to make prediction.
Matching and Hard Negative Mining. We match each ground truth box to the default box with the best jaccard overlap. Then we match the remaining default boxes to any ground truth with jaccard overlap higher than a threshold (0.5). This strategy is beneficial to predict multiple bounding boxes with high scores for overlapped objects. The negative samples with top loss value are selected from the non-matched default boxes so that the ratio of positive and negative samples is :.
Loss Function. The training objective is the weighted sum between localization loss (Smooth L1 [4]) and confidence loss (Softmax). More details can be found in [7].

Iv Experimental Results

We evaluate MDSSD on benchmark datasets, PASCAL VOC2007 [30] and MS COCO [31]

. All the experiments are implemented in Caffe

[32] on the machine with two 1080Ti GPUs. We use the well-trained SSD model as the pre-trained model for MDSSD training, and then fine-tune our model on PASCAL VOC and MS COCO. The performance is measured by mean average precision (mAP) on VOC2007 test and COCO test-dev2015 datasets. We compare the results with state-of-the-art deep convolutional networks about the mAP and inference speed.

Iv-a Pascal Voc2007

We train MDSSD on PASCAL VOC2007 and VOC2012 trainval (16551 images), and test on VOC2007 test (4952 images). Batch size is set to 32 for 300300 input and 16 for 512512 input. We use learning rate for the first iterations, then decrease it to for the next iterations and for another iterations. The momentum and weight decay are set to 0.9 and 0.0005 respectively by using SGD.

(a) images containing bottles
(b) images containing cows
(c) images containing airplanes
Fig. 5: Most of the bottles and cows on the dataset occupy small area in images as shown in (a) and (b). And the airplanes have the specific background of sky as shown in (c).

Table II shows our detection results on VOC2007 test compared with other state-of-the-art architectures. Our model with 300300 input has achieved 78.6 mAP. It exceeds the latest SSD300* by points and can be comparable to DSSD321 with 321321 input. By increasing the image size to 512512, MDSSD achieves better performance, improving mAP from 79.5 to 80.3. For some specific categories, such as bottle and cow occupying small area in most images (see Fig. 5(a) and 5(b)), the AP of MDSSD300 gains an improvement by remarkable 3 – 5 points than AP of SSD300*, outperforming most of the other deep networks. The mAP of DSSD513 is higher than that of MDSSD512, but we argue that this is because DSSD utilizes ResNet-101 as the backbone network. However, it should be noted that MDSSD512 is much faster than DSSD513, as can be observed in Table V.

Fig. 6: Sensitivity and impact of object size on VOC2007 test set using [33]. The top row shows the latest SSD results of BBox Area per category for input and input model, and the bottom row shows our results. Key: BBox Area: XS=extra-small; S=small.
Method data network Avg.Precision, IoU: Avg.Precision Avg.Recall, #Dets: Avg. Recall
0.5:0.95 0.5 0.75 Area: S 1 10 100 Area: S
Faster [5] trainval VGGNet 21.9 42.7 - - - - - -
ION [21] train VGGNet 23.6 43.2 23.6 6.4 23.2 32.7 33.5 10.1
Faster [11] trainval Residual-101 34.9 55.7 37.4 15.6 - - - -
R-FCN [29] trainval Residual-101 29.9 51.9 - 10.8 - - - -
YOLOv2 [23] trainval35k Darknet-19 21.6 44.0 19.2 5.0 20.7 31.6 33.3 9.8
SSD300* [7] trainval35k VGGNet 25.1 43.1 25.8 6.6 23.7 35.1 37.2 11.2
SSD512* [7] trainval35k VGGNet 28.8 48.5 30.3 10.9 26.1 39.5 42.0 16.5
DSSD321 [10] trainval35k Residual-101 28.0 46.1 29.2 7.4 25.5 37.1 39.4 12.7
DSSD513 [10] trainval35k Residual-101 33.2 53.3 35.2 13.0 28.9 43.5 46.2 21.8
DSOD300 [18] trainval DS/64-192-48-1 29.3 47.3 30.6 9.4 27.3 40.7 43.0 16.7
MDSSD300 trainval35k VGGNet 26.8 46.0 27.7 10.8 24.3 36.6 38.8 15.8
MDSSD512 trainval35k VGGNet 30.1 50.5 31.4 13.9 26.3 40.3 42.9 22.4
TABLE III: detection results on COCO test-dev2015
Method mAP(%)
XS S
SSD300* 49 77
SSD512* 63 81
MDSSD300 56 79
MDSSD512 66 81
TABLE IV: Comparison of mAP for XS and S, including 7 object categories shown in Fig 6

In order to verify the performance of MDSSD for small objects, we also utilize the detection analysis tool from [33]. Following the definition of [33], each object is assigned to a size category, depending on the object’s percentile size within its category: extra-small (XS: bottom 10%); small (S: next 20%); medium (M: next 40%); large (L: next 20%); extra-large (XL: next 10%). In fact, the object size within category XS under such definition is approximately consistent with a small object area defined in Section III-B. To clearly demonstrate the improvement for small object detection, we only show the results for category XS and S.

Fig. 6 shows the comparison between our methods and SSD for sensitivity and impact of object size, including 7 object categories. As shown in Table IV, MDSSD300 achieves 56% mAP and 79% mAP for category XS and S respectively, exceeding baseline SSD300* with 49% mAP and 77% mAP by 7 and 2 points respectively. The mAP of category XS and S are 66% and 81% for MDSSD512, while 63% and 81% for SSD512*. That is, our MDSSD300 model shows a significant improvement compared with SSD300* model, while there is a small gain in performance for MDSSD512 model compared with SSD512* model. This performance proves the effectiveness of our model for small objects. The AP of some specific classes improves significantly as well, such as airplane with the background of sky (see Fig. 5(c)). This may benefit from the fusion modules with context information.

We also use the initial weights model pre-trained on the ILSVRC CLS-LOC dataset [34] for MDSSD training, however we do not see any accuracy improvement but more training time.

Method backbone network GPU Input Size speed(FPS) mAP(%)
VOC2007
Faster[5] VGG16 Titan X 1000600 7 73.2
Faster[11] Residual-101 K40 1000600 2.4 76.4
R-FCN [29] Residual-101 Titan X 1000600 9 80.5
SSD300*[7] VGG16 Titan X 300300 46 77.5
SSD512*[7] VGG16 Titan X 512512 19 79.8
SSD300*[7] VGG16 1080Ti 300300 64.5 77.5
SSD512*[7] VGG16 1080Ti 512512 33.8 79.5
DSSD321[10] Residual-101 Titan X 321321 9.5 78.6
DSSD513[10] Residual-101 Titan X 513513 5.5 81.5
DSOD300[18] DS/64-192-48-1 Titan X 300300 17.4 77.7
MDSSD300 VGG16 1080Ti 300300 38.5 78.6
MDSSD512 VGG16 1080Ti 512512 17.3 80.3
TABLE V: Comparison of Speed and Accuracy on PASCAL VOC2007 dataset. All the methods are trained on the union of VOC2007 and VOC2012 trainval and tested on VOC2007 test

Iv-B Ms Coco

To further validate our model, we train MDSSD300 and MDSSD512 on MS COCO [31]. We use the trainval35k [21] (118287 images) for training and evaluate the results on the standard test-dev2015 split (20288 images). The batch size is set to 32 for 300300 input and 16 for 512512 input. We train the model with for the first iterations, then and for another and iterations. The total number of training iterations is .

MS COCO defines that the objects are small (area 32), medium (32 area 96), large (area 96), where area is measured as the number of pixels in the segmentation mask. To obtain results on COCO test-dev2015, for which the ground-truth annotations are hidden, we upload generated results to the evaluation server. In Table III, we observe that MDSSD300 achieves 26.8% AP@[0.5:0.95], 46.0% AP@0.5, and 27.7% AP@0.75, which improves the conventional SSD300* by 1.7, 2.9, and 1.9 points respectively. MDSSD512 also outperforms the baseline SSD512* by 1.3, 2.0, and 1.1 points respectively. Even though our model does not perform as well as DSSD, it should be noticed that the backbone network of MDSSD is VGG16 and MDSSD is about 4 times faster than DSSD. Compared with the other detectors based on VGG16 such as Faster R-CNN [5] and ION [21] with input, MDSSD achieves the best results.

It is noticeable that our MDSSD300 and MDSSD512 model achieve 10.8% AP and 13.9% AP for small objects (area <32) respectively, which improves SSD (6.6% and 10.9%), DSSD (7.4% and 13%), and DSOD (9.4%/-) with a large margin. It outperforms all one-stage networks both based on VGG16 and Residual-101. Our method achieves a higher AR (average recall) for small objects as well, which proves that MDSSD is more powerful on detection of small objects.

Iv-C Inference Time

New parameters need to be learned due to additional layers in MDSSD, therefore the inference speed of the network will be hampered. We use 2000 images with batch size 1 to evaluate the inference speed of MDSSD on a machine with a 1080Ti GPU. The results are presented in the 5th column of Table V, including other state-of-the-art methods. For fair comparison, we verify SSD on the same single Nvidia 1080Ti GPU as well. Our model runs at 38.5 FPS with 300300 input and 17.3 FPS with 512512 input, respectively. Although the speed is a little lower than SSD, it can still meet the real-time application. Our method exceeds the two-stage networks with a large margin in speed, and it also outperforms one-stage methods, DSSD and DSOD. It is mainly because we only implement connections for the shallow prediction modules instead of every prediction layer.

Iv-D Visualization Results

Fig. 7 and Fig. 8 show some results on PASCAL VOC2007 test and COCO test-dev2015. We only display the bounding boxes with the score higher than 0.6. Different colors of the bounding boxes indicate different object categories.

Our model performs better than conventional SSD in two cases. The first one is in scenes containing small or occluded objects as shown in Fig. 7, and the second one is in scenes containing contextual information as shown Fig. 8. The detection results of different categories with specific relationships can be improves. For example, the detection of motorbike can be beneficial for detection of person as can be observed in Fig. 8(a). We think the improvement is derived from the multi-scale Fusion Modules with semantic information designed in Fig. 2.

V Conclusions and future work

This paper proposes a Multi-scale Deconvolutional Single Shot Detector for small objects. We use multiple feature maps to better match objects with different scales and aspect ratios. The skip connections add contextual information to low-level feature maps and make them more descriptive. Experiments conducted on benchmark datasets demonstrate the effectiveness of MDSSD for small objects. While we only take SSD as the base architecture in our method, the principle can be also applied to other object detectors, such as Faster R-CNN [5].

In order to improve the detection performance, it is imperative to replace VGG by more effective networks, such as ResNet [11] and DenseNet [12]. But how to improve the inference speed of these deep backbones will be our future work. In addition, there are still some false and omissive detections in our visualized results. Some examples are given in Fig. 9 and Fig. 10. It may be caused by objects truncation and images obscure. We will investigate these issues in our future work as well.

(a) The detection results in scenes containing small or occluded objects on PASCAL VOC2007 test set
(b) The detection results in scenes containing small or occluded objects on COCO test-dev2015 set
Fig. 7: The detection results of MDSSD (column 2, column 4, and column 6) compared with SSD (column 1, column 3, and column 5) in scenes containing small or occluded objects. We can see that MDSSD yields better performance on small and occluded objects both in (a) and (b).
(a) The detection results in scenes containing contextual information on PASCAL VOC2007 test set
(b) The detection results in scenes containing contextual information on COCO test-dev2015 set
Fig. 8: The detection results of MDSSD (column 2, column 4, and column 6) compared with SSD (column 1, column 3, and column 5) in scenes containing contextual information. The results of classes with specific relationships can be improved: kid and chair, dog and sofa, motorbike and person on motorbike in (a), football and football player, surfboard and surfer, baseball and baseball plater in (b).
Fig. 9: The false detections of MDSSD on PASCAL VOC2007 test set with the score higher than 0.6.
Fig. 10: The missing detections of MDSSD on PASCAL VOC2007 test set with the score higher than 0.6.

References

  • [1] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in

    Computer Vision and Pattern Recognition

    , 2005.
  • [2] D. G. Lowe and D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
  • [3]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    International Conference on Neural Information Processing Systems, 2012, pp. 1097–1105.
  • [4] R. Girshick, “Fast r-cnn,” in International Conference on Computer Vision, 2015.
  • [5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time object detection with region proposal networks,” in International Conference on Neural Information Processing Systems, 2015.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in European Conference on Computer Vision, 2014.
  • [7] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision, 2016.
  • [8] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A unified multi-scale deep convolutional neural network for fast object detection,” in European Conference on Computer Vision, 2016.
  • [9] T. Y. Lin, P. Doll r, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Computer Vision and Pattern Recognition, 2017.
  • [10] C. Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd : Deconvolutional single shot detector,” arXiv preprint arXiv:1701.06659, 2017.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Computer Vision and Pattern Recognition, 2016.
  • [12] G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” CoRR, vol. abs/1608.06993, 2016.
  • [13] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Computer Vision and Pattern Recognition, 2001.
  • [14] P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan, “Object detection with discriminatively trained part-based models.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 2, pp. 6–7, 2014.
  • [15] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Lecun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in International Conference on Learning Representations, 2014.
  • [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Computer Vision and Pattern Recognition, 2014.
  • [17] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Computer Vision and Pattern Recognition, 2016.
  • [18] Z. Shen, Z. Liu, J. Li, Y. G. Jiang, Y. Chen, and X. Xue, “Dsod: Learning deeply supervised object detectors from scratch,” in International Conference on Computer Vision, 2017.
  • [19] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, “Beyond skip connections: Top-down modulation for object detection,” CoRR, vol. abs/1612.06851, 2016. [Online]. Available: http://arxiv.org/abs/1612.06851
  • [20] T. Kong, A. Yao, Y. Chen, and F. Sun, “Hypernet: Towards accurate region proposal generation and joint object detection,” in Computer Vision and Pattern Recognition, 2016.
  • [21]

    S. Bell, C. L. Zitnick, K. Bala, and R. Girshick, “Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks,” in

    Computer Vision and Pattern Recognition, 2016.
  • [22] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see better,” arXiv preprint arXiv:1506.04579, 2015.
  • [23] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in Computer Vision and Pattern Recognition, 2017.
  • [24] S. Honari, J. Yosinski, P. Vincent, and C. Pal, “Recombinator networks: Learning coarse-to-fine feature aggregation,” in Computer Vision and Pattern Recognition, 2016.
  • [25] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, “Beyond skip connections: Top-down modulation for object detection,” CoRR, vol. abs/1612.06851, 2016.
  • [26] P. O. Pinheiro, T. Y. Lin, R. Collobert, and P. Doll r, “Learning to refine object segments,” in European Conference on Computer Vision, 2016.
  • [27] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision, 2016.
  • [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
  • [29] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” in Advances in Neural Information Processing Systems 29, 2016.
  • [30] M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [31] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll r, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, 2014.
  • [32] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Acm International Conference on Multimedia, 2014.
  • [33] D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in European Conference on Computer Vision, 2013.
  • [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.