Pixel-Semantic Revise of Position Learning A One-Stage Object Detector with A Shared Encoder-Decoder

01/04/2020 ∙ by Qian Li, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 23

We analyze that different methods based channel or position attention mechanism give rise to different performance on scale, and some of state-of-the-art detectors applying feature pyramid are integrated with various variants convolutions with many mechanisms to enhance information, resulting in increasing runtime. This work addresses the problem by constructing an anchor-free detector with shared module consisting of encoder and decoder with attention mechanism. First, we consider different level features from backbone (e.g., ResNet-50) as the base features. Second, we feed the feature into a simple block, rather than various complex operations.Then, location and classification tasks are obtained by the detector head and classifier, respectively. At the same time, we use the semantic information to revise geometry locations. Additionally, we show that the detector is a pixel-semantic revise of position, universal, effective and simple to detect, especially, large-scale objects. More importantly, this work compares different feature processing (e.g.,mean, maximum or minimum) performance across channel. Finally,we present that our method improves detection accuracy by 3.8 AP compared to state-of-the-art MNC based ResNet-101 on the standard MSCOCO baseline.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Instruction

Figure 1: Illustration of the four feature pyramids, (a) illustrates the feature-based pyramid method [Liu2016SSD] based anchor for multi-scale objects detection, (b) fuses different horizontal features from top-to-bottom and bottom-to-top to detect multi-scale objects, (c) shows that M2Det [Zhao2018M2Det] extracts features through many U-shape modules. The author uses eight U-shape modules and combines attention mechanisms to improve detection AP. However, these methods result in much more time and space. (d) illustrates our multi-scale object detection using a shared encoder-decoder module to learn common features on multi-scale objects.

Figure 2: ResNet[Fei2017Residual]

has utilized CBAM integrated. The spatial attention mechanism and channel attention mechanism apply avg-pooling and max-pooling following a sigmoid layer to normalize features, and then use two full connection layers for the pooling to transform the feature space. Finally, the output is obtained by multiplying the result of spatial attention and the original input.

In recent year, convolutional neural networks have significantly pushed the performance of vision tasks (classification, detection and segmentation) based on their rich representation power. Top-5 average precision of the state-of-the-art among them exceeds 90% on

[deng2009imagenet]. [xie2019self]

proposed a simple self-training method, and it achieves top-1 average precision of 87.4% on ImageNet at an inference. However, for object detection task, there are main challenge such as lighting, size, overlap, etc., resulting in the poor detectors failing to detect perfectly. In particular, to handling difficulty in size,

[Lin2016Feature], [SinghAn], [Liu2016SSD] and [Zhao2018M2Det] exploit feature pyramid to improve the performance by a series of many variations of convolution or attention mechanisms, but the space or time is much more at an inference. FPN [Lin2016Feature], the mainstream among them, changes anchor across different backbones to implement well, reaching to an AP of 33.9% combined with [Girshick2015Fast]. [Lin2016Feature] based Resnet-101 successes in getting detection AP of 35.8, which is about 3 AP higher than [Ren2017Faster]. [Liu2016SSD] based on VGG16 detection, the detection average precision is 25.1%. [Xiaowei2018SINet] proposed a new context-aware ROI pooling method, it achieves an AP of 89.6% on LSVH. [LiPerceptual] applies a GAN for small objects detection. As shown in Figure 1, feature pyramid on [Lin2016Feature] [Liu2016SSD], [Zhao2018M2Det] is obtained by the top-to-bottom, bottom–to-up or both, and different level parameters is independent. Inspired by them, we assume that a shared module for multi-level features may be implemented, and extracts common features of different levels feature maps.

The attention mechanism is to extract the interest information and suppresses the useless, which is usually presented in the form of probability maps or vectors. The attention is mainly the spatial attention, the channel attention and both.

[jaderberg2015spatial] transforms the spatial information in the original image into another space and retains the key information through the attention mechanism. [hu2018squeeze] across channel domain divides the attention into three parts: squeeze, excitation and scale, compared with the standard residual module, [Fei2017Residual] uses the soft attention and the mask mechanism, not only the current network layer information plus mask, but the information from the previous layer is obtained, so that layers fail to become deeper because of the lack of information after the mask. [wang2018non] proposed non-local blocks to capture the long-range relationship, such as, capturing the relationship weight the current pixel with any pixel in 2D image, learning the relation all pixels current frame with all pixels in all frames. As shown in Figure 2, [Woo2018CBAM] based [hu2018squeeze] infers the attention map across two independent dimensions ( the channels, and the spatial), and then multiplies the attention map into input feature mapping of adaptive feature refinement to improve the classification and detection performance. CBAM [Woo2018CBAM] based on ResNet-50 classification accuracy on ImageNet dataset is 1.76% higher than the basic ResNet-50. Therefore, in terms of size, we can assume different pooling operation about channel have different detection performance. Can the minimum pooling improve the detection precision on small objects?

According to regional proposals, the object detection methods are divided into the two-stage [girshick14CVPR], [Dai2016R], [Girshick2015Fast], [Ren2017Faster] mainly focus on the regional proposals followed by many modules to extract interest information, and the one-stage such as [Redmon2017YOLO9000], [Liu2016SSD], [fu2017dssd] and [kong2017ron], the two-stage perform better than the one-stage for detection average precision, but the speed is much slower. According to with or without anchor, detectors are divided into anchor-based [Ren2017Faster], [Liu2016SSD], [Lin2016Feature] and [Zhao2018M2Det] with the setting (the density, range or shape), and anchor-free [Yu2016UnitBox], [Huang2015DenseBox], [Redmon2015You], [Redmon2018YOLOv3], [Law2018CornerNet], [Zhu2019Feature], [tian2019fcos], [kong2019foveabox], [Law2019CornerNet] and [Zhou2019Bottom] use the full convolution, corner points of the object or adaptive selection of feature level, reduce the inference time. Some anchor-free detection methods use feature pyramid to improve multi-scale detection performance. Our anchor-free method with feature pyramids is to improve detection performance. Our contributions can be summarized as follows:

  • First, we propose a shared encoder-decoder module with attention mechanism for object detection, and the module extracts the common feature of different level features or multi-scale objects, so, the parameter of different level features in the feature pyramid is shared.

  • We propose a semantic revising method corresponding to geometric location, the method based on the semantic features can detect objects adaptively, which is more flexible than just using geometric prediction, and our method is more suitable for multi-scale objects in the actual scene.

  • This work experiments the impact of the maximum, average, and minimum pooling operations on channels on the small or large objects. Combining a minimum pooling with [Woo2018CBAM] can improve the detection average precision on small objects.

  • Our experiment based ResNet-50 achieves AP@0.5 of 49.8% on standard MSCOCO benchmark. Compared with the method without the encoder-decoder with attention mechanism, our method increases detection AP of 3.1%. More importantly, our method increases detection AP of 1% on small objects.

2 Related Works

Feature Pyramid: The traditional model dealing with multi-scale object detection obtains feature pyramids by different algorithms. SSD [Liu2016SSD] directly predicts different level features, and solves multi-scale challenge to a certain extent. [Zhao2018M2Det] uses U-shape module to extract high-level features, then the extracted features are combined with the next as the input of the next U-shape. The weight of each level is independent, so this way takes more time and space at an inference. [BaeObject] solves the occlusion challenge by decomposing feature maps into different modules to learn separately, and then learns the relationship between the original feature map and each sub-module. Because the relationship between the sub-modules of each object is complex, the relationship learned becomes a challenge. To deal with this difficulty, we use a shared module to learn the features. Specifically, we can get common features of different levels.

Encoder-Decoder: There are many ways to learn high-level semantic features. The traditional algorithm, such as [Simonyan2014Very] [Fei2017Residual], learn more discriminative feature by deepening the middle module of the network, and [Fei2017Residual] introduces the residual module. Each residual module contains two paths, one of which is a direct communication path of the input feature, the other path performs a two-to-three-convolution operation on the feature to obtain the residual, and finally combines the features on the two paths to get the output. In 2014, Cho et al. first proposed an encoder-decoder to learn text sequence problems in [sutskever2014sequence]. [badrinarayanan2017segnet] uses an encoder-decoder to learn classification task, and a decoder has the same spatial size and channel number corresponding its encoder, the encoder learns more discriminative features. The decoder uses upsampling expanding the size as the same of input and learns. We exploit a shared encoder-decoder to implement the detection task.

Attention Mechanism:

Methods based deep learning, in particular, CNNs have achieved greater performance in object classification. Generally, deepening the depth of the network or paying attention mechanism improves the accuracy of classification, for example,

[Fei2017Residual] improve classification average precision by fusing original features with the higher-level features. [hu2018squeeze] improves the accuracy by modeling the correlation between feature channels and enhancing the important features. [Woo2018CBAM] adopts the channel and position attention mechanism to further improve the classification AP, and the module is lightweight and can be embedded in any structure. We propose the module to learn the small object detection task on standard MSCOCO, together the channel attention of CBAM with the minimum pooling.

3 Our Approach

Figure 3:

An overview of the proposed Pixel-Semantic Revise of Position based anchor-free. The architecture exploits the backbone and the shared encoder-decoder module with channel attention mechanism to extract features from the input image and obtain more details for location, and then the regression prediction produces the four distances (from top boundary to center, center to bottom, right to center, center to left). The semantic center prediction branch obtain center position associated with semantic information to revise the pixel-level positions. Every convolutional module consists of convolution layer, group normalization and a ReLU activation function to make the training stable and improve generalization.

Figure 4: Illustrations of our attention mechanism module, we use the average pooling and maximum pooling, following by a fully connected layer to transform the feature space into the one corresponding to the input, respectively, and the two extracted features are merged, finally, multiplied by the original features as the output.

In this section, as shown in Figure 3

, we first show that our detection structure based ResNet-50 exploits a shared encoder-decoder module with the channel attention to learn the feature pyramid, and a shared detector head contains a classification prediction branch, a detection branch, and a predictor associated with the semantic center. We introduce a pixel revise branch to make the detector more suitable for the actual application scenario. For feature pyramids, the shared encoder is downsampled by a convolution where the stride is 2, following group normalization to reduce the impact of batch size when training, and a non-linear activation function, the specific method we will introduce in detail.

3.1 Shared Encoder-Decoder Module with Attention Mechanism (SEDAM)

As shown in Figure 1

, we propose a new shared module for dealing with multi-level feature pyramids. Since the semantic features between the same class objects have similarities, we present that the shared module learns the common features between different size objects, which is beneficial to improve the generalization ability. It is a symmetrical way. In the encoder, features are downsampled by the convolution with 2 strides and 1 padding, following by a 32 groups normalization and a non-linear activation function. The more the number of layers is, the more the discriminative features extracted are. However, the more layers give rise to losing more details. In the decoder, the features are upsampled by a bilinear interpolation, and then extracted by a convolution with 1

1 kernel size and a nonlinear function. Additional, to extract more useful information, we add a channel attention in the shared encoder-decoder, and compare the effect of position attention with different pooling operations, a channel attention or the both.

CBAM [Woo2018CBAM] contains channel attention and spatial attention for classification task. We use the channel attention to enhance the detection AP, as shown in Figure 2, [Woo2018CBAM] acquires more discriminative features by a channel attention with two full connection layers, a spatial attention consisting of an average pooling, maximum one, following a full connection layer, respectively, and a multiplication between the extracted attention features and the original information is to obtain more discriminative and same space features. As shown in Figure 4, our method takes only the sub-module named a channel attention.

3.2 Shared Detector Head

To improve the performance at an inference, we apply a shared detector head, we regard the fusion of the output of the shared encoder-decoder with the original features as the input to maintain more knowledge about the position detail. When different level feature maps use the same detector head, the detection AP on small objects is better. As shown in Figure 3, in these tasks, semantic-related tasks, such as, classification prediction, semantic center prediction, and center scores prediction, we use semantic features to get the output corresponding to each bounding box are 80D, 2D, and 1D vector, respectively. While we use geometric feature maps for each bounding box corresponds to a 4D vector. Using the object semantic feature map to learn the center point task, so that we ensure that the position prediction is related to the semantics.

3.3 Margin Regression

In the feature pyramid, many bounding box candidates are obtained on the i-th feature map. Each bounding box candidate is a 4D real bounding box vector and a truth class label, which we define as , where =(, , , , ) {1,2,3,…,C}. C is the number of all class labels, we set it to 80 on COCO dataset, represents the class label in the bounding box. (, ) represents the minimum coordinate candidate of the bounding box and (, ) represents the maximum coordinate of the candidate bounding box. In our experiment, they are the left-top, the right-bottom of the bounding box , respectively. For the semantic center, =(, ) represents the j-th semantic center value of the i-th feature level bounding boxes, and the number of the semantic center and the candidate bounding box in is same. For the detector, we set a 4D real vector =(, , , ) representing the regression target of each sample, where , , and represent the distance from the center point to the left, right, top, and bottom of the bounding box, respectively. Besides the position module, there is a classification module. If the position (x, y) falls into the truth box, it indicates that the position is a positive sample and the class label of the position is , and the category label of the position is the class label of . Otherwise, the location is a negative sample and the class label is 0 (background class). When a position falls into many bounding boxes, the position is an ambiguous sample, we only select the smallest bounding box as the regression target. As shown in the Equation (1), if the position (x, y) is related to a bounding box , the training target for the position is explained. Different from the anchor-based detector, we directly regard each position of a feature map as a sample, instead of an anchor box.

(1)

3.4 Network Configures

The structure is based on ResNet-50 backbone network, and the encoder uses three downsampling layer. Similarly, the decoder uses three upsampling layer and other layers (convolution, group normalization and ReLU), following the channel attention mechanism. The channel of the base features from ResNet-50 is 256. Therefore, if we set more output channels in SEDAM, such as 1024, 4096, etc., the number of channels far exceeds the base feature map, resulting in unnecessary computation. In this work, we set it as 640. As shown in Figure 3, we set the input size as 800800.

3.4.1 Loss Function

The structure contains center prediction loss, regression loss and classification loss. We regard the center as a binary class task and use the binary cross entropy. If the position is closer to the center of the regression target, the predicted probability value is closer to 1.0. The classification loss is a focal loss with an alpha 0.25 and a gamma 2 for balancing positive and negative samples and mining difficult samples. Finally, we use a cross-correlation loss with a correlation coefficient to avoid non-overlapping parts resulting in training for nan. As described in Equation. (2

), the loss function is described in detail. In our experiments, we set the loss balance factor as 1. Those are formulated as:

(2)

4 Experiments

In this section, we describe the proposed method performs detection on large-scale COCO benchmark. We use an 80k images for training and a 45k images as an inference. To compare with the state-of-the-art methods, we exploit different settings for analysis and compared with traditional methods based FPN [Lin2016Feature]. In our experiments, we set four methods, the A method without using a shared encoder-decoder, the B method using a shared encoder-decoders with CBAM, the C method using a shared encoder-decoder combing CBAM and minimum pooling, and the last using a shared encoder-decoder with the channel attention. This section consists of three parts: (i) implementation settings, (ii) ablation study, (iii) comparison with other detectors.

Method SED CBAM IOU Aera person airplane bus train fire hydrant stop sign cat elephant bear zebra giraffe toilet clock
A - - 0.5:0.95 S 18.8 23.7 6.66 6.96 20.7 11.0 11.3 21.6 4.9 29.1 24.9 11.0 22.4
B 0.5:0.95 S 19.4 23.2 9.02 7.07 22.5 12.0 10.1 24.0 8.11 28.6 26.2 12.0 22.9
C * 0.5:0.95 S 19.2 25.4 8.76 7.4 20.8 12.1 11.6 23.4 8.17 28.2 25.9 16.7 24.7
Ours - 0.5:0.95 S 19.6 23.8 8.26 7.23 21.9 12.3 9.68 23.4 9.41 29.8 26.6 13.6 24.0
A - - 0.5:0.95 M 44.3 40.4 31.9 25.3 52.4 55.6 43.4 44.5 58.9 50.8 54.9 41.7 48.6
B 0.5:0.95 M 45.2 41.4 34.5 25.5 55.2 56.3 44.5 47.4 62.0 50.6 54.6 44.9 50.2
C * 0.5:0.95 M 45.5 43.7 34.1 28.0 57.4 57.6 43.7 46.4 58.4 51.8 56.0 43.7 49.9
Ours - 0.5:0.95 M 45.2 43.1 34.8 23.8 57.5 56.5 43.6 47.3 59.9 51.5 54.8 44.3 48.8
A - - 0.5:0.95 L 52.5 51.1 63.1 54.3 62.0 77.7 49.4 57.0 59.4 56.4 54.0 49.0 50.5
B 0.5:0.95 L 55.4 56.4 67.9 57.8 67.8 80.5 55.3 63.3 63.0 58.0 59.8 54.3 53.0
C * 0.5:0.95 L 56.3 58.6 68.2 59.9 69.1 81.4 57.3 63.6 64.6 60.8 60.2 56.7 52.3
Ours - 0.5:0.95 L 58.4 59.7 69.3 59.5 69.3 80.6 57.1 64.5 64.8 61.7 62.5 57.1 53.0
A - - 0.5 - 64.1 70.0 69.1 76.9 74.2 66.8 77.7 76.4 81.8 81.5 80.7 70.3 67.2
B 0.5 - 68.9 74.0 73.2 80.0 77.5 69.4 81.7 81.2 84.9 83.9 84.8 74.1 69.0
C * 0.5 - 68.5 75.2 72.8 80.7 78.8 69.1 82.3 80.9 84.7 85.1 84.4 76.1 69.1
Ours - 0.5 - 69.3 73.5 72.9 79.1 79.0 69.9 81.9 81.0 84.5 83.9 85.0 75.1 68.1
A - - 0.75 - 33.7 42.6 55.1 54.0 56.2 57.7 52.6 51.9 65.9 52.4 53.6 49.0 36.8
B 0.75 - 34.5 45.2 59.0 57.2 61.0 59.1 58.1 56.9 72.2 53.2 56.1 53.8 38.4
C * 0.75 - 35.1 47.7 58.9 59.4 61.2 59.4 59.8 56.3 70.2 53.7 57.8 56.0 39.6
Ours - 0.75 - 36.3 47.7 59.4 58.7 61.9 58.7 59.8 57.9 70.9 56.1 58.3 56.9 38.9
A - - 0.5:0.95 - 35.1 40.5 47.9 48.5 48.6 50.4 47.5 47.9 57.7 49.8 50.5 44.6 37.3
B 0.5:0.95 - 36.9 43.4 52.1 51.6 53.0 52.2 52.4 52.8 61.2 50.7 53.8 49.2 38.7
C * 0.5:0.95 - 37.2 45.5 52.2 53.8 54.0 53.0 53.9 52.4 61.7 52.5 54.5 50.6 39.1
Ours - 0.5:0.95 - 38.0 45.4 52.7 52.7 54.0 52.4 53.8 53.6 62.2 53.0 55.8 51.0 38.7
Table 1: Detection comparisons using different attention mechanisms, the four methods, the A method without using a shared encoder-decoder, the B method using a shared encoder-decoders with CBAM, the C method using a shared encoder-decoder combing CBAM and minimum pooling, and our method using a shared encoder-decoder with the channel attention, obtain detection AP on different classes.
Method Backbone Revise Avg.Precision, IOU: Avg.Precision, Area:
0.5:0.95 0.5 0.75 S M L
Faster R-CNN [Ren2017Faster] VGG-16 - 21.9 42.7 - - - -
OHEM++ [Shrivastava] VGG-16 - 25.5 45.9 26.1 7.4 27.7 40.3
SSD [Liu2016SSD] VGG-16 - 25.1 43.1 25.8 6.6 25.9 41.4
SSD MobileNet-v2 - 22.1 - - - - -
DSSD321 [fu2017dssd] ResNet-101 - 28.0 46.1 29.2 7.4 28.1 47.6
R-FCN [Dai2016R] ResNet-50 - 27.0 48.7 26.9 9.8 30.9 40.3
MNC [DaiInstance] ResNet-101 - 24.6 44.3 24.8 4.7 25.9 43.6
A ResNet-50 - 25.1 45.4 24.6 10.5 29.3 32.6
A ResNet-50 25.3 45.4 24.9 10.8 29.2 33.0
B ResNet-50 - 27.3 49.4 26.5 11.1 30.7 36.8
B ResNet-50 27.4 49.2 26.7 11.5 30.6 36.6
C ResNet-50 - 27.5 49.5 26.9 11.3 30.9 37.4
C ResNet-50 27.8 49.5 27.3 11.9 31.1 37.3
Ours ResNet-50 - 28.4 49.9 28.1 11.5 31.2 39.0
Ours ResNet-50 28.4 49.8 28.1 11.8 31.1 38.9
Table 2: Detection accuracy comparisons in terms of AP percentage on MS COCO benchmark.

4.1 Implementation Details

For all experiments based ResNet-50 backbone network, our network uses a random gradient descent method for 300k iterations, where an initial learning rate, a decay rate and momentum are 0.01, 0.0005, 0.9, respectively. We use ImageNet weights to initialize ResNet-50, and for both the shared encoding-decoder and detector head, we use a gauss function to initialize parameters. When in the shared encoder-decoder and detector head the channel of convolution is larger than 32, we apply Group Normalization to make the training more stable. In our work, we use 2 TITAN Xp GPUs, 8 batch size for all training during distribution training.

4.2 Ablation Study

4.2.1 With or Without Shared Encoder-Decoder

As mentioned before, the feature pyramid can achieve the multi-scale object detection. As shown in Table 1, generally, when the A method is without the shared encoder-decoder module or CBAM, the detection AP is poor on small objects. For example, the clock, the stop sign and bear achieve an AP of 22.4%, 11.0% And 4.9%, respectively. Compared with our method with SEDAM achieve detection AP of 1.6%, 1.3% and 4.51% better than the A method, respectively. Similarly, we observe the detection performance of the feature pyramid using the shared encoder-decoder module with attention mechanism on large and medium objects is better than without the module. For example, for large objects, the person, the airplane, the fire hydrant and the toilet are better than networks without shared encoder-decoder modules with attention mechanism, and the detection AP is 5.9% better, 8.6%, 7.3%, and 8.1%. As shown in Table 2, when using the semantic center revise, our last method is 1.0% higher on small object detection than the A method without using the shared encoder-decoder module with attention mechanism. Our method achieves 6.4% higher on large objects detection than without SEDAM at an inference without using the semantic center revise. Therefore, the shared encoder-decoder with attention mechanism is necessary for multi-scale object detection.

Figure 5: As shown in the figure, the test examples from different methods about a shared encoder-decoder or attention mechanisms, the first two columns indicate the example detected by the semantic center revise at an inference, and the last two columns indicate the example detected only by the geometric position. From the first row to the fourth row represent the example by the method A, B, C, and the best using a shared encoder-decoder with the channel attention, respectively.

4.2.2 Comparison Using Different Attention Mechanisms

Comparison with different Attention: We think that all attention mechanisms can improve the performance of object detection. As shown in Table 2, the four method, the method A using the attention module, the B method using a shared encoder-decoder with CBAM, the C method using a shared encoder-decoder with CBAM combing minimum pooling, and the last method using a shared encoder-decoder with the channel attention mechanism, achieve detection AP of 25.3%, 27.4%, 27.8%, and 28.4%. The minimum pooling operation is not obvious for improving detection AP. According to different IOU values, they achieve detection AP@0.75 of 24.9%, 26.7%, 27.3%, and 28.1%, respectively. At the same time, they achieve detection AP@0.5 of 45.4%, 49.2%, 49.5% and 49.8%. We observe that the shared encoder-decoder module with attention mechanism can improve the detection performance, and the minimum pooling is more sensitive to different IOU values, and the channel attention mechanism has a higher advantage for detection AP. As shown in Table 1, the method B using original CBAM, and the method C using the shared encoder-decoder together CBAM with minimum pooling, the latter is better on small objects detection task. For example, the clock, the toilet and the airplane are detection AP of 1.8%, 4.7% and 2.2% better than the former, respectively. According to Table 2, we observe that the minimum pooling performs better on small object detection task than other methods, but the channel attention mechanism has better detection performance on multi-scale object detection.

4.2.3 Inference With or Without Semantic Revise

In this work, the branch we propose is the semantic center revise. When we the network without using the semantic center revise at an inference, the network based ResNet-50 performs worse on small objects. There are four methods, the A method without using a shared encoder-decoder, the B method using a shared encoder-decoders with CBAM, the C method using a shared encoder-decoder combing CBAM and minimum pooling and the last method using a shared encoder-decoder with the channel attention, they are poor detection AP of 0.3%, 0.4%, 0.6% and 0.3% lower than the method using the semantic center revise module, respectively. Similarly, when these four detection methods use the semantic center revise module, they are better. In a summary, the semantic center revise branch makes the network more adaptive for multi-scale objects detection.

4.3 Comparison with State-of-the-art Detection

To further illustrate the shared encoder-decoder module with attention mechanism learning the feature pyramids and the semantic center revise branch can enhance the multi-scale objects detection AP, as shown in Table 2, our method based ResNet-50 is better than [fu2017dssd] based ResNet-101 and MNC [DaiInstance] based ResNet-101, and achieve detection an average precision of 0.4% and 3.8% better, respectively. On the other hand, our method is based on ResNet-50, which consumes less time and space. The four methods (A, B, C and the last) mentioned before on MSCOCO dataset benchmark achieve better than traditional detectors, especially small and middle objects, and can achieve higher detection AP than traditional detectors, such as [Ren2017Faster], [Shrivastava], [fu2017dssd], [Dai2016R], [Liu2016SSD] and [DaiInstance].

5 Discussion

As shown in Table 1 and Table 2, we believe that detection AP has a great relationship with the attention mechanism method. We observe that the minimum pooling performs better on small objects. Compared with large objects, the minimum pooling extracts much more discriminative features, and makes the model optimize toward those features from small objects, so that the method C performs better than the other on small objects. However, for all size objects, the detection AP of the shared encoder-decoder with channel attention mechanism is more efficient than others. On the other hand, according to Table 1, the shared encoder-decoder can learn the same semantic features for detecting multi-scale objects. Since we can use semantic information to revise geometric position, we can also exploit geometric position information to learn more discriminative semantic information. As shown in Figure 5, we use two inference methods, with or without semantic revise, and explain the module whether it detects objects well. In addition, we can improve performance by integrating the two. More importantly, our encoder-decoder module is used to extract the common semantic features of different size objects. However, the different classes semantic distribution may reduce detection average precision because of the difference between classes. According to these experiments, we observe that the channel attention mechanism is more effective than the spatial attention mechanism for object detection task.

6 Conclusion

We have proposed a one-stage anchor-free detector based ResNet-50 with a shared encoder-decoder with attention mechanism, using SEDAM to extract feature pyramids quickly, and the proposed method shares the parameters of multi-level feature maps to exploit the common semantic features for multi-scale objects to make the model adaptive detect objects. More importantly, it is proposed to use the semantic-related information to update the geometric position prediction, which improves the small objects on the MSCOCO benchmark. The combination of channel attention and spatial attention mechanism with minimum pooling performs better on small objects detection. In a summary, a shared encoder-decoder structure with the channel attention mechanism can improve detection AP. Our approach not only reduces time and space, but also better than traditional methods on multi-scale object detection. Additionally, we also believe that our approach is also used in other basic networks for multi-scale object detection task.

References