Object detection is one of key topics in computer vision which th goals are finding bounding box of objects and their classification given an image. In recent years, there has been huge improvements in accuracy and speed with the lead of deep learning technology: Faster R-CNN[ren2015faster] achieved 73.2
%mAP, YOLOv2 [redmon2017yolo9000] achieved 76.8
%mAP, SSD [liu2016ssd] achieved 77.5
%mAP. However, there still remains important challenges in detecting small objects. For example, SSD can only achieved 20.7
%mAP on small objects targets. Figure 1 shows the failure cases when SSD cannot detect the small objects. There are still a lot of room to improve in small object detection.
Small object detection is difficult because of low-resolution and limited pixels. For example, by looking only at the object on Figure 2, it is even difficult for human to recognize the objects. However, the object can be recognized as bird by considering the context that it is located at sky. Therefore, we believe that the key to solve this problem depends on how we can include context as extra information to help detecting small objects.
In this paper, we propose to use context information object for tackling the challenging problem of detecting small objects. First, to provide enough information on small objects, we extract context information from surrounded pixels of small objects by utilizing more abstract features from higher layers for the context of an object. By concatenating the features of an small object and the features of the context, we augment the information for small objects so that the detector can detect the objects better. Second, to focus on the small object, we use an attention mechanism in the early layer. This is also help to reduce unnecessary shallow features information from background. We select Single Shot Multibox Detector (SSD) [liu2016ssd] for our baseline in our experiments. However, the idea can be generalize to other networks. In order to evaluate the performance of the proposed model, we train our model to PASCAL VOC2007 and VOC2012 [everingham2010pascal], and comparison with baseline and state-of-the-art methods on VOC2007 will be given.
2 Related Works
Object detection with deep learning The advancement of deep learning technology has been improving the accuracy of object detection greatly. The first try for object detection with deep learning was R-CNN [girshick2014rich]
. R-CNN uses Convolutional Neural Network(CNN) on region proposals generated by using selective search[uijlings2013selective]. It is, however, too slow for real-time applications since each proposed region goes through CNNs sequentially. Fast R-CNN[girshick2015fast]
is faster than R-CNN because it performs feature extraction stage only once for all the region proposals. But those two works still use separate stage for region proposals, which becomes the main tackling point by Faster R-CNN[ren2015faster] combines the region proposal phase and classification phase into one model such that it allows so-called end-to-end learning. Object detection technologies has even been accelerated by YOLO [redmon2016you] and SSD [liu2016ssd] showing high performance enough for real-time object detection. However, they are still not showing good performance for small objects.
Small object detection Recently, several ideas has been proposed for detecting small object [liu2016ssd, fu2017dssd, jeong2017enhancement, li2017perceptual]. Liu et al [liu2016ssd] augmented small object data by reducing the size of large objects for overcoming the not-enough-data problem. Besides the approach for data augmentation, there has been some efforts for augmenting the required information without augmenting dataset perse. DSSD [fu2017dssd] applies deconvolution technique on all the feature maps of SSD to obtain scaled-up feature maps. However, it has the limitation of increased model complexity and slow down an speed due to applying deconvolution module to all feature maps. R-SSD [jeong2017enhancement] combines features of different scales through pooling and deconvolution and obtained improved accuracy and speed compared to DSSD. Li et al [li2017perceptual] uses Generative Adversarial Network(GAN) [goodfellow2014generative] to generate high-resolution features using low-resolution features as input to GAN.
Visual attention network Attention mechanism in deep learning can be broadly understood as focusing on part of input for solving specific task rather than seeing the entire input. Thus, attention mechanism is quite similar to what humans do when we see or hear something, Xu et al [xu2015show]
uses visual attention to generate image captions. In order to generate caption corresponding to images, they used Long Short-Term Memory(LSTM) and the LSTM takes a relevant part of a given image. Sharm et al[sharma2015action] applied attention mechanism to recognize actions in video. Wang et al [wang2017residual]
improved classification performance on ImageNet dataset by stacking residual attention modules.
This section will discuss the baseline SSD, then followed by the components we propose to improve small object detection capability. First, SSD with feature fusion to get the context information, named F-SSD. Second, SSD with attention module to give the network capability to focus on important parts, named A-SSD. Third, we combine both feature fusion and attention module, named FA-SSD.
3.1 Single Shot Multibox Detector (SSD)
In this section, we review Single Shot Multibox Detector (SSD) [liu2016ssd], which we are going to improve the capability on detecting small object. Like YOLO [redmon2016you], it is a one-stage detector which goal is to improve the speed, while also improving the detection in different scales by processing different level of feature maps, as seen in Fig. 2(a). The idea is utilizing the higher resolution of early feature maps to detect smaller objects while the deeper feature which has lower resolution for the larger object detection.
It is based on VGG16 [simonyan2014very] backbone with additional layers to create different resolution of feature maps, as seen in Fig. 2(a). From each of the features, with one additional convolution layer to match the output channels, the network predicts the output that consists both the bounding box regression and object classification.
However, the performance on small objects is still low, 20.7
% on VOC 2007, hence there are still many room for improvement. We believe there are two main reasons. First, the lack of context information to detect small object. On top of that, the features for small object detection are taken from shallow features which lack of semantic information. Our goal is to improve the SSD by adding feature fusion to solve the two problems. In addition, to improve more, we add attention module to make the network focuses only on the important part.
3.2 F-SSD: SSD with context by feature fusion
In order to provide context for a given feature map (target feature) where we want to detect objects, we fuse it with feature maps (context features) from higher layers that the layer of the target features.
For example in SSD, given our target feature from
conv4_3, our context features are coming from two layers, they are
conv8_2, as seen in Fig. 3.2. Although our feature fusion can be generalized to any target feature and any of its higher features.
However, those feature maps have different spatial size, therefore we propose fusion method as described in Fig. 4
. Before fusing by concatenating the features, we perform deconvolution on the context features so they have same spatial size with the target feature. We set the context features channels to the half of the target features so the amount of context information is not overwhelming the target features itself. Just for the F-SSD, we also add one extra convolution layer to the target features that does not change the spatial size and number of channels. Furthermore, before concatenating features, a normalization step is very important because each feature values in different layers have different scale. Therefore, we perform batch normalization and ReLU after each layer. Finally, we concatenate target features and context features by stacking the features.
3.3 A-SSD: SSD with attention module
Visual attention mechanism allows for focusing on part of an image rather than seeing the entire area.
Inspired by the success of residual attention module proposed by Wang et al [wang2017residual], we adopt the residual attention module for object detection.
For our A-SSD (Fig. 3.3), we put two-stages residual attention modules after
conv7. Although it can be generalized to any of layers.
Each of the residual attention stage can be described on Fig. 5.
It consists of a trunk branch and a mask branch.
The trunk branch has two residual blocks, of each has 3 convolution layers as in Fig. 4(d)
. The mask branch outputs the attention maps by performing down-sampling and up-sampling with residual connection (Fig.4(b) for the first stage and Fig.4(c) for the second stage), then finalized with sigmoid activation. Residual connections makes the features in down-sampling phase are maintained. The attention maps from the mask branch are then multiplied with the output of trunk branch, producing attended features. Finally, the attended features are followed by another residual block, L2 normalization, and ReLU.
3.4 FA-SSD: Combining feature fusion and attention in SSD
We propose method for concatenating two features proposed in section 3.2 and 3.3, it can consider context information from the target layer and different layer. Compare with F-SSD, instead of performing one convolution layer on the target feature, we put one stage attention module, as seen in Fig. 2(d). The feature fusion method (Fig.4) is same.
4.1 Experimental setup
We applied the proposed method to SSD [liu2016ssd] with same augmentation 111We use models from https://github.com/amdegroot/ssd.pytorch and weights from https://s3.amazonaws.com/amdegroot-models/ssd300_mAP_77.43_v2.pth for our baseline SSD model. .
We use SSD with VGG16 backbone and 300 300 input, unless specified otherwise.
For FA-SSD, we applied feature fusion method to
conv7 of SSD.
conv4_3 as a target,
conv8_2 are used as context layers, and with
conv7 as a target,
conv9_2 are used as context layers.
We apply attention module on lower 2 layers for detecting small object. The output of attention module has equal size with target features.
We trained our models with PASCAL VOC2007 and VOC2012 trainval datasets with learning rate for first 80k iterations, then decreased to and for 100k and 120k iterations, batch size was 16.
All of test results are tested with VOC2007 test dataset and we follows COCO [lin2014microsoft]
for objects size classification, which small objects area is less than 32*32 and large objects area is greater than 96*96. We train and test using PyTorch and Titan Xp machine.
4.2 Ablation studies
To test on the importance of each feature fusion and attention components compare with SSD baseline, we compare the performance between SSD, F-SSD, A-SSD, and FA-SSD. Table 1 shows that all F-SSD, A-SSD are better than the SSD which means each components improves the baseline. Although combining fusion and attention as FA-SSD does not show better overall performance compare with F-SSD, FA-SSD shows the best performance and significant improvement on the small objects detection.
4.3 Inference time
One interesting thing from results on Table 1 is that the speed does not always be slower with more components. This motivates us to see the inference time in more detail. Inference time in detection is divided by two, the network inference and the post processing which includes Non-Maximum Suppression (NMS). Based on Table 2, although SSD has the fastest forwarding time, it is the slowest during post processing, hence in total it is still slower than F-SSD and A-SSD.
4.4 Qualitative results
Figure 7 shows the comparison between SSD and FA-SSD qualitatively where SSD fails on detecting small objects when FA-SSD succeeds.
4.5 Attention visualization
In order to have more understanding on the attention module, we visualize the attention mask from FA-SSD. The attention mask is taken after sigmoid function on Fig.4(a). There are many channels on the attention mask, 512 channels from
conv4_3and 1024 channels from
conv7. Each of the channels focuses on different things, both the object and the context. We visualize some samples of the attention masks on Fig. 8.
4.6 Generalization on ResNet backbones
In order to know the generalization with different backbones of SSD, we experiment with ResNet [he2016deep] architectures, specifically ResNet18, ResNet34, and ResNet50. To make the features size same with the original SSD with VGG16 backbone, we take the features from layer 2 results (Fig. 5(a)). Then F-SSD (Fig. 5(b)), A-SSD (Fig. 5(c)), and FA-SSD (Fig. 5(d)) just follow the VGG16 backbone version. As seen in Table 3, everything follow the trend of the VGG16 backbone version in Table 1, except the ResNet34 backbone version does not have the best performance on the small object.
4.7 Results on VOC2007 test
For comparison with other works we compare in Table 4. All of the methods compared are trained with VOC2007 trainval and VOC2012 trainval datasets. Although we have lower performance compare to DSSD [fu2017dssd], our approach runs on 30 FPS while DSSD runs on 12 FPS.
|Faster R-CNN [ren2015faster]||73.2|
In this paper, to improve accuracy for detecting small object, we presented the method for adding context-aware information to Single Shot Multibox Detector. Using this method, we can capture context information shown on different layer by fusing multi-scale features and shown on target layer by applying attention mechanism. Our experiments show improvement in object detection accuracy compared to conventional SSD, especially achieve significantly enhancement for small object.
Appendix A Detail inference time on ResNet backbones
Table 5 shows the detail on inference time for the ResNet backbone architectures.
Appendix B VOC2012 test results
Table 6 shows the FA-SSD does not improve the SSD. The reason needs to be investigated further such as the distribution of object size of VOC2012. Especially, FA-SSD based on Table 1 actually has degradation on medium size object compare to SSD.
Appendix C Detail classes results on VOC2007
Table 7 shows the mAP from VOC2007 test data for each classes of every architectures.
|Network||Potted plant||Sheep||Sofa||Train||TV monitor|