Recently, with the development of deep convolution neural network, there have been abundant CNN based methods focusing on object detection task since the emergence of typical network of Faster-RCNN [fasterrcnn], YOLO [redmon2018yolov3], SSD [SSD], RetinaNet [retina] etc. However, object detection still suffer several problems, such as the key problem of information imbalance of different feature scales. Because the convolution neural network is designed to output a single output for classification, not for multi-scale task.
Some works have tried to fix this imbalance, such as the most popular Feature Pyramid Network (FPN), which mainly fixed the problem of lacking high semantic information in shallow layers. Although feature pyramid network can supply the semantic information for shallow features, there are still feature misalignment and information lost in deeper features. Feature misalignment refers to that there are some offsets between anchors and convolution features.
In this paper, we argue that good feature extractor for detection should have two common features: i) enough shallow image information for bounding box regression, because object detection is a typical regression task. ii) enough semantic information for classification, which means the output features come from deep layers. To satisfy these features above, we introduce a novel network specific for object detection, namely, the Image Pyramid Guidance Network (IPG-Net). The IPG-Net includes two main parts: the image pyramid guidance sub-network and the feature pyramid of ResNet. As shown in Fig. 1, it shows the comparison of a standard ResNet and our IPG-Net. The IPG-Net is designed for extracting better features by fixing the information imbalance problem better.
The deep convolution network will cause the loss of the location or spatial information as the layer becomes deeper. This property maybe not a problem for classification task, while bounding box regression is important for detection task. But, the loss of such spatial information results in the features misalignment in object detection. Here, feature alignment means there are some offsets between anchors and convolution features. Besides the lost of spatial information, small objects will easy to be lost in the deeper convolution layers. We argue that all these problems for object detect are due to the limit of the existed convolution network structure and can’t be fixed by just simply modifying typical networks.
Here, we introduce the image pyramid to supply more spatial information into each stage of the feature pyramid of the backbone network. Then the above mentioned problems can be reduced in this way. For each stage of the backbone network, we compute the image pyramid feature of the corresponding level in image pyramid. The image pyramid feature is obtained from a shallow sub-network, e.g. image pyramid guidance sub-network, which has more abundant spatial information especially for small objects. Then we design a fusion module to fuse the new image pyramid feature into the backbone network.
The fusing module performs two steps to fuse the two kinds of features. Firstly, we transform the original features to align the data size and project them into a hidden space. Secondly, We use common mathematics operation to combine the two features. Sum, product and concatenation are all used in our experiments and improvements of different degree are obtained.
Before going deeper of our proposed methods, we summarize our contributions as below:
We propose a new image pyramid guidance (IPG) network to fix the spatial information and small objects’ features lost problem in deep layers.
We design a new shallow image pyramid guidance sub-network to extract image pyramid features, which is flexible and light-weighted.
We also design a flexible fusing module, which is simple but effective.
Object detection is a basic task for deeper visual reasoning or visual understanding. The state-of-the-art works based on deep learning for object detection can be classified into one stage model and two stage model(Faster RCNN[fasterrcnn], Cascade RCNN[cai2018cascade], SNIP[snip],SNIPER[singh2018sniper] etc.), and one stage model can be further be classified into anchor based methods(Retina net[retina], Yolo-v3[redmon2018yolov3] etc.) and anchor free methods(Center net[duan2019centernet], FSAF[fsaf] etc.). All of SOTA models are based on the 3 branches, two stage methods are easier to achieve slightly better results while one stage methods have faster speed in practice. There are also some works about design backbone network specific for object detection as what we do here, Detnet is some of them[li2018detnet].
Two stage detector
Two stage algorithms keep the state of the art results in most popular data sets, such as MS COCO[coco], Pascal VOC[pascalvoc]. However, they also suffer from the speed limit and the huge complex of the model building. The information imbalance is also a tough problem for two stage algorithms, although there are some works reduce the imbalance impact in some degree, such as feature pyramid netowrk[fpn], this is still an unsolved problem.
One stage detector
To achieve faster inference speed, a lot of one stage algorithms were proposed and achieved as good performance as two stage models. The initial SOTA one stage models are based on anchor mechanism, but more efficient algorithms of anchor free are proposed recently. The most typical works including center net which motivated by key point detection[duan2019centernet], WSMA-Seg which is motivated by segmentation[cheng2019segmentation], FSAF[fsaf]. Unfortunately, the information imbalance and the feature misalignment also impact the one stage methods’ performance, especially the anchor based detectors.
Information imbalance and Feature alignment
There are also some works to solve the imbalance problem in feature level. PANet[panet] add a bottom-up path on previous FPN to shorten the information propagate path between lower feature and the topmost feature. Pang etc. propose Libra R-CNN which contains balanced feature pyramid to reduce the imbalance in feature level, e.g. the outputs of the feature pyramid network(FPN)[pang2019libra]. EFIP[EFIP] is a light weight detector, they firstly explored to use an image pyramid to build a feature pyramid for detecting very small/large objects. All of the works above are focusing on fixing the imbalance and misalignment problem, but there is still no one that can solve the problem completely in object detection. Here we propose a novel network, IPG-Net, which is based on image pyramid, the introducing of image pyramid to solve the information imbalance problem is a new path.
Image Pyramid Guidance Network(IPG-Net)
Challenges to be Solved
As mentioned in the last subsection, FPN reduce information imbalance of different scales’ features in some degree, but we think there still some challenges eager to be solved. we summary these challenges in this part.
Deep CNNs blur the feature.
Deeper convolution network enable better semantic features are extracted in classification task, which don’t need to localize the object. However, deep convolution is adverse for object detection, because the location of objects in deep features is not align with the location in the original image. But anchor based detection algorithms rely heavily on the assumption that the location of object is aligned with original images for a any feature. So there is serious misalignment between the anchor and the feature. The phenomenon becomes more serious with depth increase.
FPN suffers the misalignment.
Feature pyramid network fuse the deep features and the shallow features, resulting better detection performance. However, because of the blur of deep features, there must be misalignment between the deep features and the shallow features. For example, the spatial position corresponds to the object in the shallow layer, but the spatial position corresponds to the object in the deep layer, is not equal to .
Deep CNNs lose small objects.
Deep CNNs achieve high performance in classification due to the large stride of 32 respect to initial image size. However, large stride also leads to the miss of the detail information of the input image, e.g. the small object information. Small objects in detection task depend on the detail information of input images, so keeping the detail of small objects is essential for the backbone network. We usually detect small objects in shallow features which lack the high semantic information. Feature pyramid network is often used to build a top to down path to supply semantic information for shallow layers’ features. Although FPN introduces the semantic information, the information or features of small objects has been lost in deeper layers, so FPN can’t fix the missing problem of small objects.
The overall structure of our network is shown in Fig. 1. We use the ResNet[resnet] as the baseline to build our new backbone network, image pyramid guidance net, including image pyramid guidance sub-network, backbone network and fusing module, which provide a fair comparison with the existing methods.
The image pyramid guidance sub-network accepts a set of images from image pyramid and extracts the image pyramid features for fusing. The function of the sub-network is to extract shallow features to supply the spatial information and the detail information. The image pyramid features are used to guide the backbone network to keep the spatial information and small objects’ features. We use a fusing module to perform the guidance. The fusing module’s function is to fuse the deep features in backbone network and the shallow features in image pyramid guidance sub-network, the formulation and variants will be talk about in the next subsection. The idea of fusing module is to transform the two types of features and then combine them together to achieve the augment effect for the object detection, especially small objects detection.
Image Pyramid Guidance Sub-Network
Traditionally, we introduce the image pyramid to obtain more scales to reduce the impact of image scale, because convolution network don’t have the scale-invariant ability. The performance can be significantly improved in this way, but the computation is also too large to afford in training stage with deep neural network. Different from the traditional purpose, here we use image pyramid to guide the backbone network to learn better features used for the detection. Better features mean that all of the features of different scales have abundant spatial information and enough semantic information, e.g. there are no feature misalignment and information imbalance.
The input of the image pyramid guidance sub-network is a simple image pyramid, which can be formulated as:
where and is the image size which is same as the common input image in object detection, is the number of levels in the image pyramid. We set in our experiments to be consistent with the depth of the standard ResNet.
Next, we will describe what’s the image pyramid guidance sub-network look like, the image pyramid guidance sub-network is shown in Fig. 2. The structure of image pyramid guidance sub-network is component with two parts, one is a convolution followed with a max pooling, another is a residual block, which is kept same with the design in [resnet]. The residual block accept features with same dimensions and output features with different dimensions that are same as dimensions of features in backbone network. There are two reasons of why we use a shallow network to extract image pyramid feature. On the one hand, the function of IPG is to obtain spatial or detail information, deep convolution will lost these information. On the other hand, the computation complex will not increase too much with the light-weighted design.
The outputs of the image pyramid guidance sub-networks with image pyramid can be formulated as:
where the denotes the image pyramid guidance sub-network, as shown in Fig. 2, denotes the image pyramid feature of the level . All of the features from different level of image pyramid form image pyramid features .
The backbone network is modified from the standard ResNet which contains res 1-5. We add new stages at the end of the ResNet, each new stage contains two Bottleneck modules, same as ResNet. Our ablation studies suggest adding one new stage can perform better than the other conditions. Too deep backbone network also is harmful for the detection, We argue that the backbone which is too deep has difficulty in training.
The reason why we design deeper convolution network than the standard ResNet is the image pyramid guidance sub-network supply enough spatial information or detail information into backbone network, which reduce the impact of feature misalignment or detail lost. The advantage of the deep backbone network is that the backbone network can generate better semantic information which is good for the classification. Another advantage is the network can cover larger range of the scales of the object.
The fusing module in this paper is a flexible enough module, we first formulate it as following. The and correspond to the network of IPG sub-network and backbone network separately. The function of can be flexible with different versions.
where is the output feature the of fusing module in level . and are images in the image pyramid in level and level separately. The denotes the fusing function of the fusing module. The denotes the output of the image pyramid guidance sub-network in level and the denotes the output of the backbone network in level . If there are images in image pyramid, the number of levels is .
The fusing module is shown in the blue box in the Fig. 1. In this case, there are two inputs, image pyramid features from image pyramid guidance sub-network and the feature from the backbone network. We propose several different variants to demonstrate the effectiveness of image pyramid guidance. Sum, Product and Concatenation are the three types of fusing modules we use in our experiments. We believe that other similar design of the fusing module will also works well in our IPG-Net, especially some attention design, but we will not focus on that in this paper, we will follow the direction in future works. Next, we will describe the details of three type of variants.
We designed several variants of the fusing module to prove the robust of our image pyramid guidance mechanism. The details of them are shown in the following sections.
In this version, we regard the image pyramid information as an additional information, so the aim is to sum the image pyramid features and main features
. Due to the parameters of image pyramid network are all shared, so we need to align the channel dimension of the two types of features. Here, we use channel-dimension linear interpolate operation to perform the.
denotes the linear transforms,denotes the image pyramid guidance sub-network and denotes the backbone network from .
Here we use the product to represent the lost information in main-features . After adding the missing information into main-features, we use a ”layer norm” operation to normalize the processed main-feature .
Where the denotes the Layer Norm operation.
We also try to use concatenation operation to realize the fusing of the image pyramid feature and the main-feature, which is similar to the fusing operation in U-net[unet]. The formulation is shown as following.
Where the denotes the concatenation operation.
|Two Stage Det|
|Faster RCNN w FPN[fpn]||ResNet-101||36.2||59.1||39.0||18.2||39.0||48.2|
|Anchor based One Stage Det|
|Anchor Free One Stage Det|
|Two Stage Det|
|R-FCN w DCN[deformable]||ResNet-101||1000x600||82.6|
|One Stage Det|
We conduct ablation experiments on two data sets, MSCOCO[coco] and Pascal VOC[pascalvoc]. MSCOCO is the most common benchmark for object detection, the COCO data set is divided into train, validation, including more than 200,000 images and 80 object categories. Following common practice, we train on the COCO train2017(i.e. trainval 35k in 2014) and test on the COCO val 2017 data set(i.e. minival in 2014) to conduct ablation studies. Finally, we also report our state of the art results in MS COCO test-dev, the test is finished in CodaLab111https://competitions.codalab.org/competitions/20794 platform. We also apply our algorithm on another popular data set, Pascal VOC. Pascal VOC 2007 has 20 classes and 9,963 images containing 24,640 annotated objects and Pacal VOC 2012 also has 20 classes and 11,530 images containing 27,450 annotated objects and 6,929 segmentation. We train our model with Pascal VOC 2007 trainval set and Pascal VOC 2012 trainval set and test the model with Pascal VOC2007 test.
We follow the common training strategies for object detection, 12 epoch with 4 mini-batch in each GPU. All of the experiments are conducted in 8 NVIDIA P100 GPUs, optimized by SGD(stochastic gradient descent) and default parameters of SGD in pytorch framework are adopted. The learning rate is set as 0.01 at the beginning and decrease by a factor of 0.1 in epoch 7 and epoch 11. The linear warm-up strategy is also used, the number of warm-up iterations is 500 and the warm-up ratio is 1.0/3. All of the input images are resized intoin COCO and in Pascal VOC, which is consist with the common practice. The image pyramid is obtained by down-sampling(linear interpolate) the input image into four levels with a factor of 2.
The image size of image pyramid keep same with the training stage. The IOU threshold of NMS is 0.5, and the score threshold of predicted bounding box is 0.05. The max number of the bounding box of each image is set as 100.
Which fusing strategy is better.
We propose three different strategies to fusing the features from image pyramid and the features of the backbone network in this paper. To compare the effectiveness and the difference of them, we perform different strategies in a same baseline and report the of small, middle and large objects separately. The results in Table. 3 shows that all of three versions have similar results for small objects, but the results for middle objects and large objects have large margin between them. Table. 3 shows that the sum operation achieve much better performance in all metrics. We argue that the sum operation is easy to optimize, while product and concatenation are those operations with more tricks, e.g. hard to optimize. Here, we perform the rest experiments with fusing module.
How deep is the IPG-Net.
The Table. 4 shows that the is not always increase with the depth increase, and we also notice that the improvement comes from the large objects, while the small objects slightly decrease, . We also study the effect of keeping spatial size of the last 3 stages, as the [li2018detnet] proposed. The results shows that there is slightly improvement for small objects and middle objects , but the performance improve in is not significant. Considering the computation complex and the model performance, the depth of 5 stages is the best choice for the IPG RCNN. Here, we construct the IPG RCNN with a 4 levels image pyramid guidance sub-network and a Faster RCNN head, the backbone network is a ResNet50 which ranges from stage 1 to stage 4.
Where to perform fusing.
Here we conduct ablation experiments using a IPG-Net or a ResNet with 4 stages. Firstly, we only add one image pyramid feature into backbone network. Secondly, we also increase the level of the image pyramid to find out if more levels are better. The Table. 5 shows that IPG-Net with different configures all achieve slightly improvement compared with baseline ResNet. The best of them is , which is only improvement from the others. We conclude that the IPG-Net is not sensible enough for the position of image pyramid features and the increase of the image pyramid level also has little effect. All in all, the experiment here indeed improve the performance.
The effect on deep layers.
As we claimed in this paper, the function of image pyramid guidance is to supply the spatial information and the image details information of small objects in to deep features. Here, we conduct a simple comparison experiment to prove the effectiveness of IPG in deep layers. The configure of the experiment is simple but persuasive. The depth of the IPG-Net and the ResNet is 7 stages but we only use 4 outputs of the last four stages, which are all deep features without enough detail information. The detector we use here is RetinaNet[retina], which relies on each level of the feature pyramid.
The Table. 6 shows that IPG-Net achieve higher performance than ResNet backbone in almost all metrics. The increase of reaches . The results of Table. 6 also suggest that the IPG-Net works on RetinaNet[retina](a one stage detector). We also notice that the IPG have more significant effect on RetinaNet than Faster RCNN , because the two stage model perform ROI Pooling in shallow layers’ features while the one stage model consider features of both shallow features and deep features.
Comparison with the state of the art results in MS COCO test-dev
Finally, we also test our IPG RCNN in MS COCO test-dev to make a comparison with the state of the art detectors. We construct a modidified IPG RCNN with a IPG-Net101 and a cascade RCNN head[cai2018cascade]
. The image pyramid guidance sub-network choose stage 3 as the level to perform fusing module, because the IPG-Net is not sensible with position of fusing. The depth of the IPG-Net is four stages to make full use of the pre-trained parameters of standard ResNet in ImageNet. The IPG RCNN achievein MS COCO test-dev, which is the state of the art result compared with other detectors in the condition of single scale inference.
Comparison with the state of the art results in Pascal VOC.
To valid the results more properly, we also test the new IPG RCNN(based on Faster RCNN[fasterrcnn]) in Pascal VOC data set. The baseline is a faster RCNN with the ResNet-50 as backbone network, the performance of the baseline Faster RCNN is much better than the original paper[fasterrcnn], reaching . Then we add the fusing module into stage 3 following the ablation studies to construct a IPG RCNN with a IPG-Net50 and a faster RCNN head. The Table. 2 shows that the IPG-Net-50 obtains , we further apply multi-scale inference strategy to test the effort of the IPG-Net-50, resulting in . Furthermore, to keep consist with the previous works, we also use a 101 layers IPG-Net to get the state of the art result, the IPG-Net-101 is also fine-turned with pre-trained parameters of COCO data set. The results of single scale and multi-scale all tested in Pascal VOC2007 test. Table 2 shows that IPG RCNN101 achieves in single scale and in multi-scale. The results on two popular benchmark show that the IPG RCNN is robust enough and effective.
In this paper, the main problem we concentrate on is the information imbalance of the object detection. In the previous backbone of detection, there is serious information imbalance between the shallow layer and the deep layer. In this paper, we propose a novel image pyramid guidance net(IPG-Net), including a new sub-network based on image pyramid, a fusing module and a backbone network based on ResNet. The new sub-network can extract proper features full of the spatial information and small objects’ information. The image pyramid feature from sub-network and the feature from backbone network are fused together by a fusing module to reduce the feature misalignment problem and small objects’ missing problem in deep layers. We conduct abundant ablation experiment to prove the effectiveness of the new image pyramid guidance net. The work also can be extend to video object detection task further with the natural advantage of the image pyramid guidance.